Weapon Detection Using YOLO V3 For Smart Surveilla
Weapon Detection Using YOLO V3 For Smart Surveilla
Weapon Detection Using YOLO V3 For Smart Surveilla
Research Article
Weapon Detection Using YOLO V3 for Smart Surveillance System
Received 4 March 2021; Revised 15 April 2021; Accepted 3 May 2021; Published 12 May 2021
Copyright © 2021 Sanam Narejo et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Every year, a large amount of population reconciles gun-related violence all over the world. In this work, we develop a computer-
based fully automated system to identify basic armaments, particularly handguns and rifles. Recent work in the field of deep
learning and transfer learning has demonstrated significant progress in the areas of object detection and recognition. We have
implemented YOLO V3 “You Only Look Once” object detection model by training it on our customized dataset. The training
results confirm that YOLO V3 outperforms YOLO V2 and traditional convolutional neural network (CNN). Additionally,
intensive GPUs or high computation resources were not required in our approach as we used transfer learning for training our
model. Applying this model in our surveillance system, we can attempt to save human life and accomplish reduction in the rate of
manslaughter or mass killing. Additionally, our proposed system can also be implemented in high-end surveillance and security
robots to detect a weapon or unsafe assets to avoid any kind of assault or risk to human life.
Recent developments indicate that machine learning [3–6] making like abnormal event detection or any anomaly. The
and advance image processing algorithms have played latest anomaly detection techniques can be divided into two
dominant role in smart surveillances and security systems groups, which are object-centered techniques and integrated
[7, 8]. Apart from this, popularity of smart devices and methods. The convolutional neural network (CNN) spatial-
networked cameras has also empowered this domain. temporal system is only applied to spatial-temporal volumes
However, human objects or weapon detection and tracking of interest (SVOI), reducing the cost of processing. In
are still conducted at cloud centers, as real-time, online surveillance videos of complex scenes, researchers in [14]
tracking is computationally costly. Significant efforts have proposed a tool for detecting and finding anomalous ac-
been made in recent years to monitor robot manipulators tivities. By conducting spatial-temporal convolution layer,
that need high control performance in reliability and speed this architecture helps one to capture objects from both time
[9, 10]. The researchers have attempted to improve the domain and frequency domain, thereby extracting both the
response characteristics of the robotic system and to at- presence and motion data encoded in continuous frames. To
tenuate the uncertainties in [11]. The proposed developed do traditional functions to local noise and improve detection
robust model-free controller incorporates time delay control precision, spatial-temporal convolution layers are only
(TDC) and adaptive terminal sliding mode control implemented within spatial-temporal quantities of changing
(ATSMC) methods. pixels. Researchers proposed anomaly-introduced learning
In this research work, we aim to develop a smart sur- method for detecting anomalous activities by developing
veillance security system detecting weapons specifically multi-instance learning graph-based model with abnormal
guns. For this purpose, we have applied few compute vision and normal bimodal data, highlighting the positive instances
methods and deep learning for identification of a weapon by training coarse filter using kernel-SVM classifier and
from captured image. Recent work in the field of machine generating improved dictionary learning known as anchor
learning and deep learning particularly convolutional neural dictionary learning. Thus, abnormality is measure by
networks has shown considerable progress in the areas of selecting the sparse reconstruction cost which yields the
object detection and recognition, exclusively in images. As comparison with other techniques including utilizing ab-
the first step for any video surveillance application, object normal information and reducing time and cost for SRC.
detection and classification are essential for further object Hu et al. [15] have contributed in detecting various
tracking tasks. For this purpose, we trained the classifier objects in traffic scenes by presenting a method which de-
model of YOLO v3, i.e., “You Only Look Once” [12, 13]. This tects the objects in three steps. Initially, it detects the objects,
model is a state-of-the-art real-time object detection clas- recognizes the objects, and finally tracks the objects in
sifier. Furthermore, we are not just detecting the guns, rifles, motion by mainly targeting three classes of different objects
and fire but also getting the location of the incident and including cars, cyclists, and traffic signs. Therefore, all the
storing the data for future use. We have connected three objects are detected using single learning-based detection
systems using socket programing as a demonstration for the framework consisting of dense feature extractor and tri-
real-life scenario as camera, CCTV operator, and security modal class detection. Additionally, dense features are
panels. extracted and shared with the rest of detectors which heads
This work is an attempt to design and develop a system to be faster in speed that further needs to be evaluated in
which can detect the guns, rifles, and fire in no time with less testing phase. Therefore, intraclass variation of objects is
computational resources. It is evident from technological proposed for object subcategorization with competitive
advancements that most of the human assisted applications performance on several datasets.
are now automated and computer-based. Eventually, in Grega et al. presented an algorithm which automatically
future these computer-based systems will be replaced by detects knives and firearms in CCTV image and alerts the
more smart machines, robots, or humanoid robots. In order security guard or operator [16]. Majorly, focusing on lim-
to provide visionary sense to robots, object detection plays iting false alarms and providing a real-time application
fundamental part for understanding the objects and its where specificity of the algorithm is 94.93% and sensitivity is
interpretation. Thus, our proposed system can also be 81.18% for knife detection. Moreover, specificity for fire
implemented in surveillance and security robots to detect alarm system is 96.69% and sensitivity is 35.98% for different
any weapon or unsafe assets. objects in the video. Mousavi et al. in [17] carried out video
classifier also referred to as the Histogram of Directed
2. Literature Review Tracklets which identifies irregular conditions in complex
scenes. In comparison to traditional approaches using op-
Reducing the life-threatening acts and providing high se- tical flow which only measure edge features from two
curity are challenging at every place. Therefore, a number of subsequent frames, descriptors have been developing over
researchers have contributed to monitoring various activi- long-range motion projections called tracklets. Spatiotem-
ties and behaviors using object detection. In general, a poral cuboid footage sequences are statistically gathered on
framework of smart surveillance system is developed on the tracklets that move through them.
three levels: firstly, to extract low-level information like Ji et al. developed a system for security footage which
features engineering and object tracking; secondly, to automatically identifies the human behavior using con-
identify unusual human activities, behavior, or detection of volutional neural nets (CNNs) by forming deep learning
any weapon; and finally, the high level is about decision model which operates directly on the raw inputs [18].
Mathematical Problems in Engineering 3
Therefore, 3D CNN model for classification requires the are transformed into the same width and height 416 × 416
regularization of outputs with high-level characteristics to pixels.
increase efficiency and integrating the observations of a Object detection is primarily related to computer vision
variety of various models. that includes distinguishing objects in computerized images.
Pang et al. presented real-time concealed various object Object detection is a domain that has benefited immensely
detection under human dress in [19]. Metallic guns on from the recent advancements in the realm of deep learning.
human skeleton were used for passive millimeter wave YOLO is basically a pretrained object detector. It is a CNN
imagery which relies on YOLO algorithm on dataset of small model. A CNN is a deep learning algorithm which can take
scale. Subsequently, comparison is undertaken between in a raw input image and assign learnable weights and biases
Single MultiBox Detector algorithm, YOLOv3-13, SSD- to various aspects/objects in the image. A convolutional
VGG16, and YOLOv3-53 on PMMW dataset. Moreover, the layer in CNN model is responsible of extracting the high-
weapon detection accuracy computed 36 frames per second level features such as edges, from the input image. This
of detection speed and 95% mean average precision. Warsi A works by applying kxk filter known as kernel repeatedly over
et al. have contributed to automatically detecting the raw image. This further results in activation maps or feature
handgun in visual surveillance by implementing YOLO V3 maps. These feature maps are the presence of detected
algorithm with Faster Region-Based CNN (RCNN) by features from the given input. Thus, the preprocessing re-
differentiating the number of false negatives and false quired is much lower as compared to other classification
positives [20], thus, taking real-time images and incorpo- algorithms, whereas in standard approach, filters are hand-
rating with ImageNet dataset then training it using YOLO engineered and in CNN these are learned through a number
V3 algorithm. They have compared Faster RCNN to YOLO of iterations and training. Figure 3 indicates a basic CNN
V3 using four different videos and as a result YOLO V3 architecture as classification model for 10 different weapons.
imparted faster speed in real-time environment. Subsequently, the next layer is Max-Pooling or Subsampling
layer, which is responsible for reducing the spatial size of the
convolved features. This is to decrease the computational
3. Methodology power required to process the data through dimensionality
In this work, we have attempted to develop an integrated reduction. ReLU is a rectified linear unit activation
framework for reconnaissance security that distinguishes the expressed in (1), which is related to the feature of non-
weapons progressively, if identification is positively true it saturating activation. It eliminates undesirable values from
will caution/brief the security personals to handle the cir- an activation map effectively by setting them to nil. Finally,
cumstance by arriving at the place of the incident through IP the last layers are fully connected layers transforming the
cameras. We propose a model that provides a visionary sense data into a 1-dimensional array. To create a particular long
to a machine to identify the unsafe weapon and can also alert feature vector, the flattened output is fed to a feedforward
the human administrator when a gun or firearm is obvious neural network and backpropagation is applied to every
in the edge. Moreover, we have programmed entryways iteration of training. These layers are liable to learn nonlinear
locking framework when the shooter seems to carry ap- combinations of the high-level features as represented by the
palling weapon. On the off chance conceivable, through IP output of the convolutional layer.
webcams we can likewise share the live photo to approach ReLU: f(x) � max(0, x). (1)
security personals to make the move in meantime. Also, we
have constructed the information system for recording all As mentioned earlier that YOLO is a pretrained object
the exercises to convey impact activities in the metropolitan detector, a pretrained model simply means that another
territories for a future crisis. This further ends up in de- dataset has been trained on it. It is extremely time con-
signing the database for recording all the activities in order suming to train a model from scratch; it can take weeks or a
to take prompt actions for future emergency. Figure 1 month to complete the training step. A pretrained model has
presents the overall generalized approach of our research already seen tons of objects and knows how each of them
work divided into three parts. must be classified. The weights in the abovementioned
The most important and crucial part of any application is pretrained model have been obtained by training the net-
to have a desired and suitable dataset in order to train the work on COCO and Imagenet dataset. Thus, it can only
machine learning models. Therefore, we manually collected detect objects belonging to the classes present in the dataset
huge amount of images from Google. A few of the image used to train the network. It uses Darknet-53 as the back-
samples are shown in Figure 2. For each weapon class, we bone network for feature extraction and uses three scale
collected at least 50 images. Using google-images-download predictions. The DarkNet-53 is again convolutional neural
is one of the best ways to collect images for constructing network that has 53 layers as elucidated in Figure 4. Dar-
one’s own dataset. We further saved those images to a folder kNet-53 is a fully convolutional neural network. Pooling
called “images.” One must save images in “.jpg” form; if the layer is replaced with a convolution operation with stride 2.
images are in different extensions, it will be a little trou- Furthermore, residual units are applied to avoid the gradient
blesome and will generate errors when provided for training. dispersion.
Alternatively, since the images are processed in terms of Initially, CNN architectures were quite linear. Recently,
batches, therefore prior to training, the sizes of all the images numerous variations are introduced, for example, middle
4 Mathematical Problems in Engineering
Frame conversion
Training data Get location
Image preprocessing
Classification
Alert
Object detection
Database
Object identification
blocks, skip connections, and aggregations of data between stacked over it, accumulating to a total of a 106-layer fully
layers. These network models have already acquired rich convolutional architecture. Due to its multiscale feature
feature representations by getting trained over a wide range fusion layers, YOLO V3 uses 3 feature maps of different
of images. Thus, selecting a pretrained network and using it scales for target detection as shown in Figure 5.
as a starting point to learn a new task is a concept behind
transfer learning. In order to recognize the weapons, we took 4. Experimental Results
the weights of a pretrained model and trained another
YOLO V3 model. Image classification includes, for example, the class of one
YOLO V3 is designed to be a multiscaled detector rather object in a picture. However, object localization is to rec-
than image classifier. Therefore, for object detection, clas- ognize the area of at least one article in a picture and drawing
sification head is replaced by appending a detection head to a proliferating box around their degree as shown in Figure 6.
this architecture. Henceforth, the output is vector with the Moreover, Figure 7 illustrates the detection of rifle from an
bounding box coordinates and probability classes. YOLO V3 animated video. The shape of the detection kernel is com-
inherits Darknet-53 as its backbone, a framework to train puted by 1 × 1 × (bb x (4 + 1 + nc)). Hence, bb is the number
neural networks with 53 layers as indicated in Figure 4. of bounding boxes, “4” is for the 4 bounding box coordinate
Moreover, for object detection task additional 53 layers are positions and 1 is object confidence, and nc is the number of
Mathematical Problems in Engineering 5
fc_3 fc_4
Fully connected Fully connected
Conv_1 Conv_2 neural network neural network
convolution convolution ReLU activation
Max-pooling Max-pooling
(5 × 5) kernel (5 × 5) kernel
valid padding (2 × 2) valid padding (2 × 2) (With
dropout)
0
Fl
att
2
en
ed
9
Input n1 channels n1 channels n2 channels n2 channels
(28 × 28 × 1) (24 × 24 × n1) (12 × 12 × n1) (8 × 8 × n2) (4 × 2 × n2) Output
n3 units
Figure 3: Feedforward convolutional neural network (CNN).
classes. The downsampling of the input image is for three Type Filters Size Output
scale predictions and is computed by strides 32, 16, and 8. Convolutional 32 3×3 256 × 256
The loss function over here is comprised on three sections, Convolutional 64 3 × 3/2 128 × 128
location error (Lbox), confidence error (Lcls), and classifi- Convolutional 32 1×1
cation error (Lobj), as presented in (2). 1× Convolutional 64 3×3
Loss � Lbox + Lcls + Lobj . (2) Residual 128 × 128
Convolutional 128 3 × 3/2 64 × 64
Literature suggests that YOLO v2 often struggled with Convolutional 64 1×1
small object detections. This happened due to loss of fine- 2× Convolutional 128 3×3
grained features as the layers downsampled the input. In Residual 64 × 64
conclusion, YOLO v2 applies an identity mapping, con- Convolutional 256 3 × 3/2 32 × 32
catenating feature maps from a previous layer to capture Convolutional 128 1×1
low-level features. However, YOLO v2’s architecture was 8× Convolutional 256 3×3
lacking some of the influential essentials that are encapsu- Residual 32 × 32
lated in most of state-of-the-art algorithms. The early models Convolutional 512 3 × 3/2 16 × 16
were lacking in the residual blocks, skip connections, and Convolutional 256 1×1
upsampling. On the other hand, YOLO v3 incorporates all of 8× Convolutional 512 3×3
these. The detection of smaller objects can be seen from Residual 16 × 16
cumulative results demonstrated in Figure 8. We retrained Convolutional 1024 3 × 3/2 8×8
both YOLO V2 and YOLO V3. Alternatively, we also Convolutional 512 1×1
conducted comparative analysis of the models with tradi- 4× Convolutional 1024 3×3
tional CNN which was trained from the very scratch with
Residual 8×8
null weights. The obtained results are summarized in Table 1.
Avgpool Global
The subsequent part of our research is based on the
Connected 1000
recording of location where the weapon was detected so that
Softmax
the alarm is generated. For this purpose, at backend we have
also created a Database. A desktop application is also de- Figure 4: Architectural details of DARKNET-53 layers [10].
veloped in order to provide connectivity with the database
system. There are four attributes that are collected from the map their positions. As it can be seen from the relational
site where an object like weapon was detected. The collected table provided in Figure 9, the attributes are latitude, lon-
information needs to be translated into a geographical gitude, time, and location where weapons were seen or
format of longitude and latitude. For this purpose, geo- identified. At backend DAO (Data Access Object) layer is
coding was performed. It is the method of translating ad- also available to show the user the data from the database. It
dresses to geographical details, longitude, and latitude, to is component of Java Foundation Classes (JFC), which is a
6 Mathematical Problems in Engineering
GUI-providing API for Java programs. Swing provides Our proposed system is further compared with the
packages that let us render our Java programs a complex existing literature in Table 2. In [21], the proposed system
collection of GUI components and it really is platform includes CNN-based VGG-16 architecture as feature ex-
independent. Figure 10 presents the class diagram and tractor, followed by state-of-the-art classifiers which are
implementation of DAO layer. implemented on a standard gun database. The researchers
Mathematical Problems in Engineering 7
Figure 7: Real-time weapon detected from a video surrounded by bounding box. Weapon category rifle.
<<Interface>>
DAO
+getLocation()
Connectivity
–Connection con Database
+connect()
DAO [mp]
+getLocation()
Implemented by Implemented by GunDetection
OfficialPanel +setRecord()
+getLoaction()
Location
–float lat
–float lon
Used by –strib = ng loc
Used by
+getFloat lat()
+setFloat lat (float lat) : void
+getFloat lon()
+setFloat lon (float lon) : void
+getString loc()
+setString loc (String loc) : void
investigated four machine learning models, namely, BoW, surveillance systems with the growing availability of cheap
HOG + SVM, CNN, and Alexnet + SVM, to recognize the computing, video infrastructure, high-end technology, and
firearms and knifes from a dataset of images [22]. Their work better video processing.
suggests that pretrained Alexnet + SVM performed the best.
As it is evident from the previous studies, researchers have Data Availability
widely applied CNN and its variant for weapon or knife
identification from CCTV videos [23]. It is obvious from The data are available on request.
Table 2 that the implemented YOLO v3 outperforms the rest
of the other models. Conflicts of Interest
The authors declare that they have no conflicts of interest.
5. Conclusion and Future Work
In this study, the state-of-the-art YOLO V3 object detection References
model was implemented and trained over our collected [1] S. A. Velastin, B. A. Boghossian, and M. A. Vicencio-Silva, “A
dataset for weapon detection. We propose a model that motion-based image processing system for detecting poten-
provides a visionary sense to a machine or robot to identify tially dangerous situations in underground railway stations,”
the unsafe weapon and can also alert the human adminis- Transportation Research Part C: Emerging Technologies,
trator when a gun or a firearm is obvious in the edge. The vol. 14, no. 2, pp. 96–113, 2006.
experimental results show that the trained YOLO V3 has [2] United Nations, Office on Drugs and Crime, Report on “Global
better performance compared to the YOLO V2 model and is Study of Homicide”, https://www.unodc.org/documents/data-
less expensive computationally. There is an immediate need and-analysis/gsh/Booklet1.pdf.
to update the current surveillance capabilities with improved [3] P. M. Kumar, U. Gandhi, R. Varatharajan, G. Manogaran,
R. Jidhesh, and T. Vadivel, “Intelligent face recognition and
resources to support monitoring the effectiveness of human
navigation system using neural learning for smart security in
operators. Smart surveillance systems would fully replace internet of things,” Cluster Computing, vol. 22, no. S4,
current infrastructure with the growing availability of low- pp. 7733–7744, 2019.
cost storage, video infrastructure, and better video pro- [4] V. Babanne, N. S. Mahajan, R. L. Sharma, and P. P. Gargate,
cessing technologies. Eventually, the digital monitoring “Machine learning based smart surveillance system,” in
systems in terms of robots would fully replace current Proceedings of the 2019 Third International Conference on
Mathematical Problems in Engineering 9
I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I- [19] L. Pang, H. Liu, Y. Chen, and J. Miao, “Real-time concealed
SMAC), pp. 84–86, IEEE, Palladam, India, December 2019. object detection from passive millimeter wave images based
[5] A. Joshi, N. Jagdale, R. Gandhi, and S. Chaudhari, “Smart on the YOLOv3 algorithm,” Sensors, vol. 20, no. 6, p. 1678,
surveillance system for detection of suspicious behaviour 2020.
using machine learning,” in Intelligent Computing, Informa- [20] A. Warsi, M. Abdullah, M. N. Husen, M. Yahya, S. Khan, and
tion and Control Systems. ICICCS 2019. Advances in Intelligent N. Jawaid, “Gun detection system using YOLOv3,” in Pro-
Systems and Computing, A. Pandian, K. Ntalianis, and ceedings of the 2019 IEEE International Conference on Smart
R. Palanisamy, Eds., vol. 1039, Berlin, Germany, Springer, Instrumentation, Measurement and Application (ICSIMA),
Cham, 2020. pp. 1–4, IEEE, Kuala Lumpur, Malaysia, August 2019.
[6] K.-E. Ko and K.-B. Sim, “Deep convolutional framework for [21] G. K. Verma and A. Dhillon, “A handheld gun detection using
abnormal behavior detection in a smart surveillance system,” faster r-cnn deep learning,” in Proceedings of the 7th Inter-
Engineering Applications of Artificial Intelligence, vol. 67, national Conference on Computer and Communication
pp. 226–234, 2018. Technology, pp. 84–88, Kurukshetra, Haryana, November
[7] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and 2017.
T. Faughnan, “Smart surveillance as an edge network service: [22] S. B. Kibria and M. S. Hasan, “An analysis of feature extraction
from harr-cascade, SVM to a lightweight CNN,” in Pro- and classification algorithms for dangerous object detection,”
ceedings of the 2018 IEEE 4th International Conference on in Proceedings of the 2017 2nd International Conference on
Collaboration and Internet Computing (CIC), pp. 256–265, Electrical & Electronic Engineering (ICEEE), pp. 1–4, IEEE,
Philadelphia, PA, USA, April 2018. Rajshahi, Bangladesh, December 2017.
[8] R. Xu, S. Y. Nikouei, Y. Chen et al., “Real-time human objects [23] A. Castillo, S. Tabik, F. Pérez, R. Olmos, and F. Herrera,
tracking for smart surveillance at the edge,” in Proceedings of “Brightness guided preprocessing for automatic cold steel
the 2018 IEEE International Conference on Communications weapon detection in surveillance videos with deep learning,”
(ICC), pp. 1–6, Kansas City, MO, USA, May 2018. Neurocomputing, vol. 330, pp. 151–161, 2019.
[9] S. Ahmed, A. Ahmed, I. Mansoor, F. Junejo, and A. Saeed, [24] V. Gun, “Database,” http://kt.agh.edu.pl/grega/guns/.
“Output feedback adaptive fractional-order super-twisting
sliding mode control of robotic manipulator,” Iranian Journal
of Science and Technology, Transactions of Electrical Engi-
neering, vol. 45, no. 1, pp. 335–347, 2021.
[10] S. Ahmed, H. Wang, and Y. Tian, “Adaptive fractional high-
order terminal sliding mode control for nonlinear robotic
manipulator under alternating loads,” Asian Journal of
Control, 2020.
[11] S. Ahmed, H. Wang, and Y. Tian, “Adaptive high-order
terminal sliding mode control based on time delay estimation
for the robotic manipulators with backlash hysteresis,” IEEE
Transactions on Systems, Man, and Cybernetics: Systems,
vol. 51, no. 2, pp. 1128–1137, 2021.
[12] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: unified, real-time object detection,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 779–788, Las Vegas, NV, USA, June 2016.
[13] A. Farhadi and R. Joseph, “Yolov3: an incremental im-
provement,” Computer Vision and Pattern Recognition, 2018.
[14] C. He, J. Shao, and J. Sun, “An anomaly-introduced learning
method for abnormal event detection,” Multimedia Tools and
Applications, vol. 77, no. 22, pp. 29573–29588, 2018.
[15] Q. Hu, S. Paisitkriangkrai, C. Shen, A. van den Hengel, and
F. Porikli, “Fast detection of multiple objects in traffic scenes
with a common detection framework,” IEEE Transactions on
Intelligent Transportation Systems, vol. 17, no. 4, pp. 1002–
1014, 2015.
[16] M. Grega, A. Matiolański, P. Guzik, and M. Leszczuk, “Au-
tomated detection of firearms and knives in a CCTV image,”
Sensors, vol. 16, no. 1, p. 47, 2016.
[17] H. Mousavi, S. Mohammadi, A. Perina, R. Chellali, and
V. Murino, “Analyzing tracklets for the detection of abnormal
crowd behavior,” in Proceedings of the 2015 IEEE Winter
Conference on Applications of Computer Vision, pp. 148–155,
IEEE, Waikoloa, HI, USA, January 2015.
[18] S. Ji, W. Xu, M. Yang, and K. Yu, “3 D convolutional neural
networks for human action recognition,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 35, no. 1,
pp. 221–231, 2012.