Final 2

International Journal of Electrical and Computer Engineering (IJECE)
Vol. 99, No. 1, Month 2099, pp. 1∼1x

ISSN: 2088-8708, DOI: 10.11591/ijece.v99i1.pp1-1x r 1
Detecting anomalies in security cameras with 3D convolutional

neural network and Convolutional Long Short-term Memory
Esraa A. Mahareek1 , Eman K. El-Sayed 2 , Nahed M. El-Desouky3 , Kamal A. El-Dahshan 4
1,2,3 Mathematics department Faculty of science Al-Azhar University (Girls branch), Cairo, Egypt
2 School
of Computer science, Canadian International College CIC, Cairo, Egypt
4 Mathematics department Faculty of science Al-Azhar University, Cairo, Egypt
Article Info ABSTRACT

Article history: This paper presents a new method for anomaly detection in surveillance
videos using deep learning. The proposed method is based on a deep net-
Received month dd, yyyy
work trained to identify objects and human activities in videos. The pro-
Revised month dd, yyyy posed method combines the advantages of both 3D Convolutional neural
Accepted month dd, yyyy network (3DCNN) and Convolutional Long Short-term Memory (Con-
vLSTM) to detect anomalies in surveillance videos. The 3DCNN is used
Keywords: to extract spatio-temporal features from the video frames, while the
ConvLSTM is used to capture temporal dependencies between frames.
Anomaly detection The method was tested on five real-world large-scale datasets (UCF-
3D Convolutional neural Crime, XD-Violence, UBI-Fights, and CCTV-Fights, UCF-101) con-
network taining indoor and outdoor video sequences, as well as on synthetic
Surveillance videos datasets with different object sizes, appearance, and activity type. Fur-
bidirectional ConvLSTM thermore, the results show that combining 3DCNN and ConvLSTM
fight detection can improve accuracy and reduce false positives that achieves a high
violence detection accuracy and Area under the ROC Curve (AUC) both in indoor and
outdoor scenarios compared to state-of-the-art methods reported in the
comparison.
This is an open access article under the CC BY-SA license.
Corresponding Author:
Esraa A. Mahareek
Department of mathematics, Faculty of science, Al-Azhar University, Cairo, Egypt
Email: [email protected]
1. INTRODUCTION
The monitoring ability to maintain public safety and its quick response to serve this aim, as
protection is the major reason for deploying video surveillance systems, is a considerable issue even
for people. The use of surveillance systems has increased, but human capacity has not kept up [1]. As
a result, it takes a lot of supervision to spot unusual events that could endanger anyone or a business,
even though there is a significant loss of labour and time given how unlikely anomalous events are to
occur compared to regular ones.[2]
Surveillance video is an important source of data for law enforcement, security, and other
organization. It is an automated system used for monitoring indoor or outdoor environments such
as airports, malls, and parking lots. The recorded video streams are converted into images using 2D
(2-dimensional) or 3D (3-dimensional) cameras. These images are the analysed by computer vision
algorithms to detect objects, people, and actions in the scene. Detection of unusual events in these
scenes is an important task in video surveillance systems as it enables detection and response to
Journal homepage: http://ijece.iaescore.com

2 r ISSN: 2088-8708
unexpected events such as robberies, assaults, vandalism, or traffic collisions.

However, anomaly events are rare compared to normal ones monitoring surveillance videos is
very important which is very time consuming, so developing computer vision system that automatically
detect anomaly action in surveillance videos is very necessary. The low – resolution and discontinuous
nature of many surveillance videos can make it difficult to changes in the scene. Traditional approaches
to this problem rely on hand-crafted feature extractors to identify anomalous events. These approaches
are time-consuming and difficult to maintain as the video format evolves over time. Recent advances
in machine learning have made it possible to train algorithms to perform anomaly detection without
manually identifying features.
The problem definition for detecting anomalies in surveillance videos involves developing
algorithms that can identify events or behaviors that deviate from the expected norm in a given
environment. This task can be particularly challenging due to the complexity and variability of real-
world environments, as well as the need for algorithms to operate in real-time, with minimal delay
or latency. Furthermore, the algorithms must be able to distinguish between normal and abnormal
events with a high level of accuracy to avoid false positives or negatives. To achieve this, various
approaches have been proposed such as machine learning techniques that can automatically learn from
training data and identify patterns of normal behavior. Another approach is to use deep learning
techniques, such as convolutional neural networks (CNNs), which have shown promising results in
detecting anomalies in surveillance videos. However, the effectiveness of these approaches depends
heavily on the quality and quantity of training data available, as well as the specific features that are
used to represent normal and abnormal behaviors. Additionally, the development of more advanced
sensors and cameras that can capture high-quality video data with greater detail and resolution has
also contributed to improving the accuracy.
In this paper we propose a new method for automatically detection and classification of
anomalies on video recordings using convolutional neural network which extract features from video
frames and classify them according to different anomaly classes such as assault, robbery, and fighting.
In this approach we choose 3D Convolutional neural network (3D-CNN) to learn short term spatial
temporal features of anomalies followed by Convolutional Long Short-term Memory (ConvLSTM) to
learn long term temporal spatial features then combined these networks in unified architecture to
perform classification of surveillance videos to improve training stability and performance. Several
layers of convolutional networks are trained with millions of images in order to learn unique image
features that are discriminative for different anomaly classes and for each class they are trained to
separate the normal from the abnormal frames in a video recording. This is achieved by evaluating
similarity between a feature vector extracted from a normal frame to a feature that is extracted for
an anomaly frame belonging to the same class and then classifying the frame as either normal or
abnormal by calculating a similarity score between the two features vectors. The main disadvantage
of this approach is that it requires a large number of training images and very large datasets to be
able to train the network to learn useful image features. So, we trained our model on large dataset like
UCF-crime that contains more than 128 hours of recording videos divided into 8 anomaly classes and
1 normal class. We evaluate our model performance on the held-out test data and the results show
that it has reasonable classification accuracy for different types of anomaly events and outperforms
other recent approaches.
Firstly, we describe the data set used in this paper and also describe how it was pre-processed
and trained and tested using a 3D-CNN approach to detect different types of anomalies. Then we
describe the results obtained for the test dataset and show classification accuracy and Area under
the ROC Curve (AUC) for each dataset. This paper is organized as following: Section 2 describes a
literature review of various works related to this research study. Section 3 describes 3D-CNN. Section
4 describes the proposed technique. Section 5 describes the dataset; and section 6 describes briefly
how the training data is pre-processed followed by a discussion and conclusions.
2. LITERATURE REVIEW
In the field of action detection, using computer vision to identify certain actions in security
cameras has grown in popularity. This work is related to the computer vision field. Many researchers
Int J Elec & Comp Eng, Vol. 99, No. 1, Month 2099: 1-1x
Int J Elec & Comp Eng ISSN: 2088-8708 r 3
have been trying to develop efficient machine-learning methods for automatic video anomaly detection
task. Fig. 1 represents the paper distribution for anomaly detection from publicly available literature
between 2015 and 2021[3] and some related keywords. A model-based technique for anomaly iden-
tification for surveillance footage is proposed by Kamoona et al [4] (2019). The system is divided
into two phases. On this platform, numerous handcrafted features have been displayed. Additionally,
Cool 3-Dimensional (C3D) features and anomaly detection using support vector machine (SVM) have
been extracted from video data using deep learning approaches. These techniques were applied by
Sultani, Chen, and Shah (2018) [5]. Behavior modelling is the following stage. At order to learn
the representation of usual behaviour, SVM is trained using a Bag of Visual Words (BOVW) in this
phase.
Campus violence is the most dangerous kind of school bullying and is a global societal problem.
As AI and remote monitoring capabilities develop, there are several potential methods to detect
campus violence, including video-based ones. Ye et al. (2021) [6] use audio and visual data to detect
campus violence. Data on campus violence is gathered through role-playing, and every 16 frames
of video are used to extract 4096-dimension feature vectors. The 3D CNN is employed for feature
extraction and classification, and an overall precision of 92.00 percent is achieved.
Figure 1. Distribution of papers on violence detection per year
The Trajectory-Pooled Deep Convolutional Networks ConvNet model, which has 17 convolu-
tion pool-norm layers and two fully linked layers, was employed by Meng et al. (2020) [6]. He applies
his algorithm to both crowded and uncrowded datasets with 92.5% accuracy in the Crowd Violence
dataset and 98.6% in the Hockey Fight dataset. A new method for evaluating whether a movie con-
tains violent scenes is presented by Rendón-Segador et al. (2021) [7]. It is based on a modified 3D
DenseNet for a multi-head self-attention layer and a bidirectional ConvLSTM module.
A weakly supervised anomaly localization (WSAL) technique is put out by Hui Lv et al.[7],
and it focuses on temporally localising anomalous portions inside anomalous films. Inspired by the
visual contrast in bizarre videos. To locating anomalous segments, the evolution of nearby temporal
segments is assessed. To do this, a high-order context encoding model is suggested that not only
extracts semantic representations but also measures dynamic variations to make efficient use of the
temporal environment.
Due to the difficulty in accurately capturing both the spatial and temporal information of
successive video frames, video classification is more complex than it is for static images. The 3D
convolution operator was suggested by S. Ji et al. [8] for computing features from both spatial
and temporal data. By examining the synergy between dictionary-based representation and self-
supervised learning, Wu et al. [9] offer a self-supervised sparse representation (S3R) framework in
4 r ISSN: 2088-8708
2022 that models the concept of anomalous at the feature level. The Magnitude Contrastive Loss
and the Feature Amplification Mechanism are proposed by Chen et al. in 2022 [10] to improve the
discriminativeness of feature magnitudes for identifying anomalies. Results of experiments using the
UCF-Crime and XD-Violence benchmark datasets.
3. 3D CONVOLUTIONAL NEURAL NETWORK

A 3D CNN is a type of neural network composed of several 2D convolutional layers followed
by several layers of nonlinear units (called the “fully connected” layers), all arranged in several parallel
planes (i.e., three dimensionally). A convolution can be applied along the time dimension to extract
temporal patterns in the data, just like convolutional layers can do the same for spatial patterns in
image data. But, if our data contains both spatial and temporal patterns, as it is with video data, we
should study these two types of patterns together since they can combine to create more complicated
spatio-temporal patterns. The basic idea behind a 3D CNN is to process the image or the video
sequence in two dimensions (spatial and temporal) sequentially in order to obtain the final result.
By extending CNN, 3DCNN does this by enlarging the convolution kernel. Extraction of
video features is effective using 3DCNN [11]. For a more thorough analysis, 3DCNN extracts the
spatial-temporal features from the entire video. The 3D convolution kernel is used to extract regional
spatio temporal neighborhood information, which is appropriate given the video’s data format. Eq.
(1) is a representation of the formula 3DCNN:
i −1 Q
∑ P∑ ∑ i −1 R
∑ i −1
xyz pqr (x+p)(y+q)(z+r)
vij = Relu(bij + wijm k(i−1)m ) (1)
m p=o q=0 r=0
Where Relu stands for the buried layer’s activation function. The current value at position (x, y, z)
xyz
in the ith and j th feature graph sets is represented by vij . The bias of the ith layer and the j th
feature map is represented by the term bij . The (p, q, r) value of the kernel associated to the mt h
th
pqr
feature map in the preceding layer is represented by wijm . Pi , Qi stand for the height and width of
the convolution kernel, respectively, and Ri is the size of the 3D kernel along the temporal dimension.
4. CONVOLUTIONAL LSTM NEURAL NETWORK (CONVLSTM)

The ConvLSTM was created especially for difficulties predicting spatial-temporal sequences.
ConvLSTM may extract spatial and temporal features from feature graph sets more effectively than
standard LSTM [11]. This is so that ConvLSTM, which analyses and forecasts the events in time
series, can take the spatial information of a single feature map into account. Therefore, ConvLSTM
can be used to resolve timing issues more effectively in dynamic anomaly recognition. The flowing
equations are used to formulate the ConvLSTM [12] equations.
it = σ(Wxi ∗ Xt + Whi ∗ H(t−1) + Wci ◦ C(t−1) + yi ) (2)
ft = σ(Wxf ∗ Xt + Whf ∗ H(t−1) + Wcf ◦ C(t−1) + yi ) (3)
Ct = ft ◦ C(t−1) + it ◦ tanh(Whc ∗ H(t−1) + Wxc ∗ Xt + yc ) (4)
Ot = σ(Wxo ∗ Xt + Who ∗ H(t−1) + Wco ◦ Ct + yo ) (5)
ft = Ot ◦ tanh(Ct ) (6)
The inputs are X1 , X2 , ..., Xt , the cell outputs are C1 , C2 , ..., Ct , and the hidden states are H1 , H2 , ..., Ht .
The three-dimensional tensors of ConvLSTM are the gates it , ft , and Ot , respectively. Rows and
columns are the final two dimensions, which are spatial. The convolution operator and ”Hadamard
product” are denoted by the operators ” ∗ ”and ” ◦ ” respectively. The batch normalisation layer and
dropout layer are added to the ConvLSTM in this instance.
5. PROPOSED METHOD
LSTM and 3D CNN are coupled to classify video. We will outline the 3DCNNConvLSTM
model’s architecture in this part. We recommended a 3D convolution neural network (3DCNN) fol-
lowed by a Convolutional long short-term memory (ConvLSTM) network as a feature extraction
model for the dynamic anomaly identification process. The 3DCNN-ConvLSTM model’s architec-
ture is shown in Fig. 2. A stack of continuous anomaly video frames that have been downsized to
16×32×32×3 form the input layer. Four 3D convolutional layers, each with a different filter (32, 32,
64, and 64), make up the architecture. however, the same 333 kernel size. After then, a layer of
ConvLSTM with 64-unit sizes was applied. A ReLU layer and a batch normalisation layer come after
each 3DCNN layer. 3D Max Pooling and dropout layers were placed between each pair of 3DCNN
layers. Dropout layers with values of 0.3 and 0.5 were added. A fully connected layer with 512 is used
to implement the output probability, and it is followed by the Softmax activation function, which has
many output units equal to the number of anomaly video classes.
To classify the test video, it is collected, divided into 16 consecutive frames, and supplied
into the trained/learned model. The features discovered by the model are used to determine the
probability score for each frame. The majority voting schema is then given the forecast of 16 frames
as input, and based on the probability score of each frame, it predicts the label of the video sequence.
The majority voting formula is given in Eq (7).
Y = modeC(X1, C2, C3, ..., C(X16) (7)
Where X1, X2, ..., andX16 indicate the frames taken from the tested video and Y is the class label of
the sign gesture video. The anticipated class label is represented by C(X1), C(X2), C(X3), ..., C(X16)
for each frame.
Figure 2. General design of the 3DCNN-ConvLSTM model

6 r ISSN: 2088-8708
6. DATASETS
Finding and analysing anomalies in video data is becoming increasingly popular. In order
to meet this need, we apply our approach to multiple significant datasets of videos to detect and
characterize anomalies. For example UCF-Crime [5], XD-Violence [13], UBI-Fights [14], NTU CCTV-
Fights [15], and UCF-101 [16].
The first dataset is a large, different variety of 128 hours of video. It consists of 1900 long,
eight categories of crimes, including assault, arson, fighting, burglary, explosion, arrest, abuse, and
road accidents, are listed in these movies. The collection also includes ”Normal” videos, meaning those
without any recordings of crimes. Two tasks can be accomplished using this dataset. First, a general
analysis of anomalies is performed, considering all anomalies in one group and all regular activities in
another. Figure 3 shows how the percentage of videos in each UCF-Crime class are distributed per
class.
The second dataset, XD-Violence[13], is a massive, multi-scene dataset with a duration of 217
hours and a total of 4754 untrimmed films with audio signals and shaky labels. The third dataset,
UBI-Fights, contains 80 hours of footage that has been completely annotated at the frame level and
is focused on a particular anomaly detection while still offering a wide variety of combat scenarios.
consisting of 1000 videos, of which 216 feature battle scenes and 784 depict ordinary daily occurrences.
In order to prevent disruptions to the learning process, all extraneous video segments such as video
introductions and news were deleted. The titles of the videos include indicators on the kind of each
video, such as indoor and outdoor environments, RGB and grayscale videos, Fixed, Rotate, and
Moveable cameras.
The forth dataset, UCF-101 is a collection of real-world action videos from YouTube with 101
different action categories.It has 13320 videos across 101 action categories.The videos are divided into
25 groups, each of which may contain 4–7 videos of an activity, from the 101 action categories. Videos
from the same group could have comparable backgrounds, points of view, and other characteristics in
common.
Figure 3. The distribution of the percentage of videos in each UCF-Crime per class
The final dataset, CCTV-Fights, includes 1,000 videos of actual fights captured by CCTVs or
portable cameras. There are 280 CCTV films in total, with fights ranging in length from 5 seconds to
12 minutes, with an average of 2 minutes. Additionally, it includes 720 footage of actual battles taken
from other sources (referred to as Non-CCTV in this document), mostly from mobile cameras but
Table 1. Detailed information on each dataset utilized in the comparison

Dataset #videos # hours #Violence types Size
UCF-Crime 1900 128 9 60GB
XD-violence 4754 217 6 123GB
CCTV 1000 17.68 1 7.2 GB
UBI-Fight 1000 80 2 7.9GB
UCF101 13320 27 101 7 GB
sometimes occasionally from dashcams, drones, and helicopters. These movies range in length from
3 seconds to 7 minutes, with an average of 45 seconds, however some of them have numerous fights,
which can aid the model in making more generalisations. The datasets utilized in this experiment are
fully described in Table 1.
7. IMPLEMENTATION
We divide each dataset into 75%:25% training and testing splits for evaluation. Each split is
further divided in five folds that each contain approximately on third of the total videos for training
or validation and the remaining videos are for testing. The deep learning model was put into practise
using a Windows 10 Pro computer, an Intel Corei7 CPU, and 16 GB of RAM. Python was used to
implement the system, along with the Anaconda environment and Spyder editor. Both Keras and
TensorFlow were part of deep learning libraries. For handling and pre-processing data, the Python
OpenCV package was used.
There are numerous parameters in deep learning models that influence their development and
effectiveness. We’ll talk about how the amount of iterations affects the functionality of our network.
One of the most crucial hyperparameters in contemporary deep learning systems is the number of
iterations. In practise, fewer iterations are needed to train the model, which speeds up computation
significantly thanks to Graphics processing unit (GPU) parallelism. On the other hand, employing
greater iteration numbers resulted in training taking longer than using small iteration numbers, but
testing accuracy was higher. The amount of the model training dataset can largely affect the batch
size and epochs number.
8. EXPERIMENTAL RESULTS
Performance evaluation is an important duty. So, The AUC (Area Under the Curve) is
used to assess or depict the performance of the multi-class classification issue. It is one of the most
fundamental evaluation criteria for assessing the effectiveness of any classification model. AUC stands
for the level or measurement of separability. It reveals how well the model can differentiate across
classes.
The two metrics used for classification models are accuracy and AUC. A model with high
accuracy makes very few erroneous predictions. The cost to the firm of those inaccurate estimates
isn’t considered, though. The use of accuracy measurements in these business challenges abstracts
away the TP and FP specifics and provides model forecasts an excessive amount of confidence, which
is harmful to business goals. As it calibrates the trade-off between sensitivity and specificity at the
best-chosen threshold, AUC is the preferred statistic in such circumstances. Additionally, accuracy
assesses the performance of a single model, whereas AUC compares two models and assesses the
performance of a single model at various thresholds.
The recognition accuracy and AUC test were used to evaluate how well the trained models
were working. In our experiments, the performance for recognition accuracy and AUC according
to the proposed 3DCNN+ConvLSTM model at 10, 30, and 50 iterations all at batch size 32 are
listed in table 2. According to Fig. 4 and 5 show the performance on the training and validation
for UCF-crime datasets on 10 and 30 epochs respectively. While fig 6 show the performance on the
training and validation for UCF-101 datasets at 100 epochs, the model clearly performed well on the
8 r ISSN: 2088-8708
training dataset. For model training, the training accuracy was almost 100%. The trained model was
tested using 25% of the dataset, and the best recognition accuracy rate was 100% on iteration value
50 epochs for UCF-101 dataset. While the model achieves accuracy 98.5%, 95.1%, 99%, 97.1% for
UCF-crime, XD-violence, CCTV, UBI-fight datasets respectively. The model achieves the maximum
recognition accuracy in the five datasets when trained for 50 epochs which take 25 hours for UCF-
crime dataset. While the model achieves 92.2%, 87.7%, 94.3%, 93.3%, 92.3% AUC for UCF-crime,
XD-violence, CCTV, UBI-fight datasets respectively. As the results are competitive with respect to
the recent research mention in the comparison in tables 3, 4, 5 and 6.
Table 3 compares the results for more models given by other studies for the UCF-crime
dataset in order to properly evaluate the model and demonstrates that our model provides the best
AUC results 92.2% in 50 epochs and achieve 87.7%, 95.1% for AUC and accuracy for CCTV-fights
dataset respectively. Table 4 compares the results for more models given by other studies for the XD-
violence dataset in order to properly evaluate the model and demonstrates that our model provides
the best AUC results 87.7% at 50 epochs. Table 5 compares the results for more models given by
other studies for the UBI-fights dataset in order to properly evaluate the model and demonstrates
that our model provides the best AUC results 93.3% in 50 epochs. Table 6 compares the results for
more models given by other studies for the UCF101 dataset in order to properly evaluate the model
and demonstrates that our model provides the best accuracy results 100% in 50 epochs. Figure 7 and
8 demonstrates features characteristic along the real time for abuse and explosion videos for example.
Table 2. Comparison of our model’s performance for the datasets of UCF- crime, XD-violence,
CCTV, UBI-Fight, and UCF-101 datasets
Dataset
Measure UCF- crime XD-violence CCTV UBI-Fight UCF-101
Accuracy_10 89% 81.9% 91.7% 89.7% 90.7%
AUC_10 80% 79% 83% 82.6% 87%
Accuracy_30 93.4% 92.3% 94.1% 93.1% 95.1%
AUC_30 85.6% 83.2% 89% 89.8% 89.3%
Accuracy_50 98.5% 95.1% 99% 97.1% 100%
AUC_50 92.2% 87.7% 94.3% 93.3% 92.3%
Figure 4. the model’s training and validation accuracy for the UCF-crime dataset for 10 epochs
Figure 5. the model’s training and validation accuracy for the UCF-crime dataset for 30 epochs
Figure 6. the model’s training and validation accuracy for the UCF-101 dataset for 100 epochs
Table 3. A comparison between the results of our model and other models for UCF-crime dataset
REF. AUC Method year
[10] 86.98% (Magnitude-Contrastive Glance-and-Focus Network) MGFN 2022
[9] 85.99% Self-Supervised Sparse Representation (S3R) 2022
[7] 85.38% Weakly Supervised Anomaly Localization(WSAL) 2020
[17] 84.89% Learning Causal Temporal Relation and
Feature Discrimination for Anomaly Detection 2021
[18] 84.48% Multi-stream Network with Late Fuzzy Fusion 2022
[19] 84.03% Robust Temporal Feature Magnitude learning (RTFM) 2021
ours 92.2% 3DCNN+ConvLSTM 2023
10 r ISSN: 2088-8708
Table 4. A comparison between the results of our model and other models for XD-violence dataset
[9] 80.26% S3R 2022
[10] 82.11% MGFN 2022
[19] 77.81% RTFM 2021
[20] 83.54% Cross-Modal Awareness Local-Arousal (CMA-LA) 2022
[21] 83.4% modality-aware contrastive instance learning with self-distillation (MACIL-SD) 2022
Table 5. A comparison between the results of our model and other models for UBI-fights dataset
[1] 90.6% Gaussian Mixture Model-based (GMM) 2020
[5] 89.2% Sultani et al. 2018
[22] 61% Variational AutoEncoder (S2-VAE) 2018
Figure 7. The anomaly detection in abuse video along the real time
Table 6. A comparison between the results of our model and other models for UCF-101 dataset
[23] 98.64% frame selection SMART 2020
[24] 98.6% OmniSource 2020
[25] 98.2% Text4Vis 2022
[26] 98.2% Local and Global Diffusion (LGD-3D) 2019
ours 100% 3DCNN+ConvLSTM 2023
Figure 8. The anomaly detection in explosion video along the real time
9. CONCLUSION
Since deep learning is a potent artificial intelligence technique for video categorization, we
suggested an anomaly detection model employing it in this study. The 3DCNN and ConvLSTM
models work together to address anomaly detection issues. Applying the suggested approach to five
large-scale datasets allowed us to evaluate it. Excellent performance was shown by the five datasets,
and model training accuracy was 100%. The reliability of the recognition was correspondingly 98.5%,
99.2%, and 94.5%. When compared to 3DCNN, 3DCNN+ConvLSTM produced a decent performance
with the datasets. The results from our study show that the model is more accurate than the other
competing models. An extension of the current work we intend to create a model for predicting
anomalies from surveillance video.
REFERENCES
[1] B. M. Degardin, “Weakly and Partially Supervised Learning Frameworks for Anomaly Detection,”
2020, doi:10.13140/RG.2.2.30613.65769.
[2] Priyanka Patel, and Amit Thakkar, “The upsurge of deep learning for computer vision applica-
tions,” International Journal of Electrical and Computer Engineering (IJECE),Vol. 10, No. 1, pp.
538 -548, February 2020, doi:10.11591/ijece.v10i1.pp538-548.
[3] B. Omarov, S. Narynov, Z. Zhumanov, A. Gumar, and M. Khassanova, “State-of-the-art violence

12 r ISSN: 2088-8708
detection techniques in video surveillance security systems: A systematic review,” PeerJ Comput.
Sci., vol. 8, 2022, doi: 10.7717/PEERJ-CS.920.
[4] A. M. Kamoona, A. K. Gostar, A. Bab-Hadiashar, and R. Hoseinnezhad, “Sparsity-Based Naive

Bayes Approach for Anomaly Detection in Real Surveillance Videos,” in ICCAIS 2019 - 8th Inter-
national Conference on Control, Automation and Information Sciences, Oct. 2019. doi: 10.1109/IC-
CAIS46528.2019.9074564.
[5] W. Sultani, C. Chen, and M. Shah, “Real-world Anomaly Detection in Surveillance Videos,” Jan.
2018, [Online]. Available: http://arxiv.org/abs/1801.04264
[6] L. Ye, T. Liu, T. Han, H. Ferdinando, T. Seppänen, and E. Alasaarela, “Campus violence detection
based on artificial intelligent interpretation of surveillance video sequences,” Remote Sens., vol. 13,
no. 4, pp. 1–17, 2021, doi: 10.3390/rs13040628.
[7] H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li, and J. Yang, “Localizing Anomalies from Weakly-Labeled
Videos,” IEEE Trans. Image Process., vol. 30, pp. 4505–4515, 2021, doi: 10.1109/TIP.2021.3072863.
[8] S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional neural networks for human action
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, 2013, doi:
10.1109/TPAMI.2012.59.
[9] J.-C. Wu, H.-Y. Hsieh, D.-J. Chen, C.-S. Fuh, and T.-L. Liu, “Self-supervised Sparse Representa-
tion for Video Anomaly Detection,” pp. 729–745, 2022, doi: 10.1007/978-3-031-19778-9_42.
[10] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, “MGFN: Magnitude-
Contrastive Glance-and-Focus Network for Weakly-Supervised Video Anomaly Detection,” 2022,
doi:10.48550/arXiv.2211.15098.
[11] E. K. Elsayed and D. R. Fathy, “Semantic Deep Learning to Translate Dynamic Sign Language,”
Int. J. Intell. Eng. Syst., vol. 14, no. 1, pp. 316–325, Nov. 2020, doi: 10.22266/IJIES2021.0228.30.
[12] X. Shi, Z. Chen, H. Wang, D. Y. Yeung, W. K. Wong, and W. C. Woo, “Convolutional LSTM
network: A machine learning approach for precipitation nowcasting,” Adv. Neural Inf. Process.
Syst., vol. 2015-Janua, no. July, pp. 802–810, 2015.
[13] P. Wu et al., “Not only Look, But Also Listen: Learning Multimodal Violence Detection Under
Weak Supervision,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect.
Notes Bioinformatics), vol. 12375 LNCS, pp. 322–339, 2020, doi: 10.1007/978-3-030-58577-8_20.
[14] B. Degardin , H. Proenca, “Human activity analysis: Iterative weak/self-supervised learning

frameworks for detecting abnormal events,” IJCB 2020 - IEEE/IAPR Int. Jt. Conf. Biometrics,
2020, doi: 10.1109/IJCB48548.2020.9304905.
[15] A. A. Einstein, “DETECTION OF REAL-WORLD FIGHTS IN SURVEILLANCE VIDEOS

(CCTV-Fights-ICASSP2019),” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc.,
pp. 2662–2666, 2019, doi::10.1109/ICASSP.2019.8683676.
[16] Khurram Soomro, Amir Roshan Zamir and Mubarak Shah, ”UCF101: A Dataset of 101 Human
Action Classes From Videos in The Wild”, CRCV-TR-12-01, November, 2012.
[17] P. Wu and J. Liu, “Learning Causal Temporal Relation and Feature Discrimination
for Anomaly Detection,” IEEE Trans. Image Process., vol. 30, pp. 3513–3527, 2021, doi:
10.1109/TIP.2021.3062192.
[18] K. V. Thakare, N. Sharma, D. P. Dogra, H. Choi, and I. J. Kim, “A multi-stream deep neural
network with late fuzzy fusion for real-world anomaly detection,” Expert Syst. Appl., vol. 201, 2022,
doi: 10.1016/j.eswa.2022.117030.
[19] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, “Weakly-supervised Video
Anomaly Detection with Robust Temporal Feature Magnitude Learning,” Proc. IEEE Int. Conf.
Comput. Vis., pp. 4955–4966, 2021, doi: 10.1109/ICCV48922.2021.00493.
[20] Yujiang Pu; Xiaoyu Wu, “Audio-Guided Attention Network for Weakly Supervised Violence De-
tection,” in 2022 2nd International Conference on Consumer Electronics and Computer Engineering
(ICCECE), 2022, doi: 10.1109/ICCECE54139.2022.9712793.
[21] J. Yu, J. Liu, Y. Cheng, R. Feng, and Y. Zhang, “Modality-aware Contrastive Instance Learn-
ing with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection,” pp. 6278–6287,
2022, doi: 10.1145/3503161.3547868.
[22] T. Wang et al., “Generative Neural Networks for Anomaly Detection in Crowded Scenes,” IEEE
Trans. Inf. Forensics Secur., vol. 14, no. 5, pp. 1390–1399, 2019, doi: 10.1109/TIFS.2018.2878538.
[23] S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “SMART Frame Selection for Action
Recognition,” 35th AAAI Conf. Artif. Intell. AAAI 2021, vol. 2B, pp. 1451–1459, 2021, doi:
10.1609/aaai.v35i2.16235.
[24] H. Duan, Y. Zhao, Y. Xiong, W. Liu, and D. Lin, “Omni-Sourced Webly-Supervised Learning
for Video Recognition,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect.
Notes Bioinformatics), vol. 12360 LNCS, pp. 670–688, 2020, doi: 10.1007/978-3-030-58555-6_40.
[25] W. Wu, Z. Sun, and W. Ouyang, “Transferring Textual Knowledge for Visual Recognition,” 2022,
[Online]. Available: http://arxiv.org/abs/2207.01297
[26] Z. Qiu, T. Yao, C. W. Ngo, X. Tian, and T. Mei, “Learning spatio-temporal representation with
local and global diffusion,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol.
2019-June, pp. 12048–12057, 2019, doi: 10.1109/CVPR.2019.01233.
BIOGRAPHIES OF AUTHORS
Esraa A. Mahareek received the B.Sc. degree in computer science, in 2012, the M.Sc.
degree in computer science, in 2021. She is currently an Teaching assistant in computer
science at the Mathematics Department, Faculty of Science, Al-Azhar University, Cairo,
Egypt. She has published research paper in the field of AI, machine learning, metaheuristic
optimization, she can be contacted at email: [email protected].
Prof. Eman K. Elsayed Prof. Eman K. Elsayed Dean of Canadian International

College school of computer science, Bachelor of Science from computer science Department,
Cairo University 1994, Master of computer science from Cairo university 1999, computer
science PhD 2005 from Alazhar university, Professor of computer science from 2019 . She
Published 65 papers until Mars 2023 in different branches of AI (eman_k_elsayed@cic-
cairo.com).
Prof. Nahed M. El-Desouky Prof. Nahed Mohamed El Desouky is an associate

professor of Computer Science and Information Systems at Al-Azhar University(girls) in
Cairo, Egypt. Bachelor of Science from the faculty of engineering, Cairo University in 1982,
Master of communication and Electronics from the faculty of engineering Cairo university
1990, communication and Electronics Ph.D. 1999 from the faculty of Engineering at Cairo
University. Associate Professor of computer science from 2010. She Published 23 papers
in different branches of computer science ; Secuirty, cloud computing and optimization.
([email protected]).
14 r ISSN: 2088-8708
Prof. Kamal Abdelraouf ElDahshan is a professor of Computer Science and Infor-

mation Systems at Al-Azhar University in Cairo, Egypt. At Al-Azhar, he founded the
Centre of Excellence in Information Technology, in collaboration with the Indian govern-
ment, and was also the founder and former president of the coordination bureau of the
Egyptian Knowledge Bank, the country’s largest initiative for academic access. An Egyp-
tian national and graduate of Cairo University, he obtained his doctoral degree from the
Université de Technologie de Compiègne in France, where he also taught for several years.
During his extended stay in France, he also worked at the prestigious Institut National de
Télécommunications in Paris.
Professor ElDahshan’s extensive international research, teaching, and consulting experi-
ence has spanned four continents and include academic institutions as well as government
and private organizations. He taught at Virginia Tech as a visiting professor; he was a
Consultant to the Egyptian Cabinet Information and Decision Support Centre (IDSC);
and he was a senior advisor to the Ministry of Education and Deputy Director of the
National Technology Development Centre.
Professor ElDahshan is a professional Fellow on Open Educational Resources (OER) as
recognized by the United States Department of State and an Expert with the Arab League
Educational, Cultural and Scientific Organization.A tireless advocate for equitable access
to knowledge too all, he co-founded, in 2018, the Arab Foundation for the Deaf and Hearing
Impaired, which aims at supporting and empowering its beneficiaries to contribute to the
public, scholarly, and cultural lives of their communities, as equals.Among other accolades,
he is a Fellow of the British Computing Society, and a Founding Member of the Egyptian
Mathematical Society ([email protected]).

Final 2

Uploaded by

Copyright:

Available Formats

Final 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final 2

Uploaded by

Copyright:

Available Formats

International Journal of Electrical and Computer Engineering (IJECE)

Vol. 99, No. 1, Month 2099, pp. 1∼1x

Detecting anomalies in security cameras with 3D convolutional

Article Info ABSTRACT

This is an open access article under the CC BY-SA license.

Journal homepage: http://ijece.iaescore.com

unexpected events such as robberies, assaults, vandalism, or traﬀic collisions.

Figure 1. Distribution of papers on violence detection per year

3. 3D CONVOLUTIONAL NEURAL NETWORK

4. CONVOLUTIONAL LSTM NEURAL NETWORK (CONVLSTM)

it = σ(Wxi ∗ Xt + Whi ∗ H(t−1) + Wci ◦ C(t−1) + yi ) (2)

ft = σ(Wxf ∗ Xt + Whf ∗ H(t−1) + Wcf ◦ C(t−1) + yi ) (3)

Ct = ft ◦ C(t−1) + it ◦ tanh(Whc ∗ H(t−1) + Wxc ∗ Xt + yc ) (4)

Ot = σ(Wxo ∗ Xt + Who ∗ H(t−1) + Wco ◦ Ct + yo ) (5)

Y = modeC(X1, C2, C3, ..., C(X16) (7)

Figure 2. General design of the 3DCNN-ConvLSTM model

Table 1. Detailed information on each dataset utilized in the comparison

[3] B. Omarov, S. Narynov, Z. Zhumanov, A. Gumar, and M. Khassanova, “State-of-the-art violence

[4] A. M. Kamoona, A. K. Gostar, A. Bab-Hadiashar, and R. Hoseinnezhad, “Sparsity-Based Naive

[14] B. Degardin , H. Proenca, “Human activity analysis: Iterative weak/self-supervised learning

[15] A. A. Einstein, “DETECTION OF REAL-WORLD FIGHTS IN SURVEILLANCE VIDEOS

Prof. Eman K. Elsayed Prof. Eman K. Elsayed Dean of Canadian International

Prof. Nahed M. El-Desouky Prof. Nahed Mohamed El Desouky is an associate

Prof. Kamal Abdelraouf ElDahshan is a professor of Computer Science and Infor-

You might also like