HAR Documentation
HAR Documentation
HAR Documentation
The collected frames, each containing the activity region of a person, are then
input into the Long Short-Term Memory Recurrent Convolutional Network
(LRCN) model for activity prediction. By using the temporal dependencies
captured by LRCN, we aim to accurately recognize activities performed by each
individual throughout the video sequence.
Our study focuses on training and evaluating deep learning models, including
ConvLSTM and LRCN, using the UCF50 - Action Recognition Dataset. This
dataset comprises diverse action categories, providing a robust testing ground
for our proposed approach.
The objective of this research is to develop a system capable of accurately
recognizing activities performed by multiple individuals in real-world video
streams. By integrating state-of-the-art deep learning techniques with object
detection and tracking mechanisms, we aim to advance the capabilities of
human activity recognition systems and contribute to applications such as
surveillance, sports analysis, and human-computer interaction.
2. Literature Survey
This paper[1] uses deep learning for human action recognition. It
combines both CNN and LSTM for this. The CNN is used to extract useful
patterns from a frame and LSTM is used to store these extracted patterns. In this
way CNN extracts useful patterns from each frame in a video and LSTM stores
these patterns for overall video analysis. They tested this model with a big
dataset and found that the proposed system works faster than others with a
better accuracy of about 80%.
This paper[2] uses depth sensors for human action recognition. They
done this by creating a special set of information based on skeleton shapes
they in the images that are captured by sensors. Then finally with the help of
Multi-class Support Vector Machine the action is predicted.
The main aim of this paper[3] is to alert the people when ever an
unusual activity is detected. This can be done by combining YOLO(You Look
Only Once) with deep learning models. YOLO is used for object detection and
deep learning models are used for classification of action which is performed by
recognised object. This deep learning models are able to recognise simple to
complex actions.
This paper [4]propose a new approach for human action recognition. The
approach is nothing but combining all the deep learning models to made a
hybrid model with high accuracy. The proposed model is then tested with
datasets like UCF sports, UCF 101, KTH dataset. The results shows that the
proposed model on an average gives an accuracy of 96.3% when it is tested
with KTH dataset.
This paper [5] proposes a new approach for human action recognition.
The proposed approach is that the Action Recognition system takes the data
from the accelerometers and gyroscopes in smartphones and they used
datamining and machine learning techniques for action recognition. At first they
used Random Forest algorithm as a machine learning technique and later due to
its complex computations it is modified to Modified Random Forest Algorithm
which creates small decision trees for classification.
This paper [6]talks about a new method called “Action Fusion”. This
method helps computer to understand what humans are doing with help of a
picture that show how someone moves and information about his specific body
part. They use three different ways to train the computer and finally they
combine all these ways to get the best result from the computer.They tested this
model on three datasets and found that it works better than other ways people
have tried.
In this paper[7] the authors studied ten recent techniques that use the
Kinect camera to recognize actions, and they tested them using six different sets
of data. They also made some improvements to some of these techniques and
tested them too. What they found was that most methods are better at
recognizing actions when different people are doing them compared to when the
same person is viewed from different angles. They also found that techniques
focusing on the skeleton of the person are better for recognizing actions from
different angles than techniques focusing on the depth of the image.
we will visualize the data along with labels to get an idea about what we will be
dealing with. We will be using the UCF50 - Action Recognition Dataset,
consisting of realistic videos taken from youtube which differentiates this data
set from most of the other available action recognition data sets as they are not
realistic and are staged by actors. The Dataset contains:
50 Action Categories
25 Groups of Videos per Action Category
133 Average Videos per Action Category
199 Average Number of Frames per Video
320 Average Frames Width per Video
240 Average Frames Height per Video
26 Average Frames Per Seconds per Video
For visualization, we will pick 20 random categories from the dataset and a
random video from each selected category and will visualize the first frame of
the selected videos with their associated labels written. This way we’ll be able
to visualize a subset (20 random videos) of the dataset.
3.3 Preprocess the dataset for Analysis
In the first step, we create the extract_frames() function to extract frames from
videos and resize them to 255 pixels. Additionally, we perform normalization
and remove unwanted frames that do not contain activity information, returning
only useful frames. Subsequently, we resize the frames to a standard size of 64 x
64 pixels, a common preprocessing practice. We set the sequence length to 20
frames, which serves as the default throughout the project.
In the second step, we define the dataset_creation() function, which involves
mapping features, labels, and video paths for each selected video category. To
skip frames effectively, we calculate the skip factor using the formula:
skip_frames_window = max(int(video_frames_count/SEQUENCE_LENGTH)
skipping every 10th frame from the sequence, ensuring a representative subset
of frames for analysis.
Finally, we convert the encoded indexes of different classes to one-hot encoding
using Keras, a built-in library in Python. This conversion facilitates the
representation of categorical variables in a more suitable format for deep
learning models.
3.4 Divide the data into test and train sets
Before proceeding with the training and testing of our proposed models, it is
crucial to partition our structured dataset into separate training and testing sets.
The key requirements for splitting of data are features and
one_hot_encoded_labels on categorical data. To achieve this, we utilize the
sklearn library in Python to split the dataset, allocating 75% of the data for
training and reserving 25% for testing. t. To avoid any kind of bias we rearrange
the dataset by putting shuffle = True. And also, we set random_state to 27. This
partitioned dataset will serve as the basis for training and testing our
ConvLSTM and LRCN models.
4.1: Implement ConvLSTM
In this step, we will implement the first approach: the ConvLSTM model. This
model blends Convolutional Neural Networks (CNNs) with Long Short-Term
Memory (LSTM) cells. A ConvLSTM cell is essentially an LSTM network but
with convolutions integrated into its structure. This unique combination allows
the model to recognize spatial features in the data while also considering the
temporal relationships between them.
For constructing our model, we'll utilize the Keras library's ConvLSTM2D
recurrent layers. These layers are crucial as they integrate convolutional
operations into the LSTM architecture. We'll specify the number of filters and
kernel size required for these convolutions. After processing through
ConvLSTM2D layers, the output is flattened and passed to a Dense layer with
softmax activation. This layer computes the probability of each action category.
We will also use MaxPooling3D layers to reduce the dimensions of the frames
and avoid unnecessary computations and Dropout layers to prevent overfitting
the model on the data. The architecture is a simple one and has a small number
of trainable parameters. This is because we are only dealing with a small subset
of the dataset which does not require a large-scale model.
The ConvLSTM model we've built consists of a total of 44,524 parameters, all
of which are trainable. There are no non-trainable parameters in this model.
Below is a detailed breakdown of the layers in the ConvLSTM model:
1.ConvLSTM2D Layer: This layer integrates convolutional operations into the
LSTM architecture, enabling the model to learn spatial features while
considering temporal relationships.
2.MaxPooling3D Layer: This layer reduces the dimensions of the frames,
helping to streamline computations and prevent overfitting.
3.Dropout Layer: This layer aids in preventing overfitting by randomly
dropping a fraction of input units during training.
4.Flatten Layer: This layer flattens the output of the ConvLSTM2D layer into
a one-dimensional array.
5.Dense Layer: This layer computes the probability of each action category
using softmax activation.
Overall, the ConvLSTM model is designed to effectively process video data,
capturing both spatial and temporal information to make accurate predictions.
From the metrics it is observed that LRCN performed better than ConvLSTM
with 95% accuracy and 20% loss. While ConvLSTM provided 80% accuracy.
6. Integrate Multi-Person Recognition Technique:
From the results, it appears that the LRCN model has performed very well in a
small number of classes. so in this step, we will create a method which
combines Yolo model, LRCN prediction model and multi-person activity
recognition logic. Initially, the YOLO model is loaded to identify persons in
each frame of the video. Upon detecting persons, their regions are processed for
prediction using the LRCN model.
3. Padmaja, Budi, Madhu Bala Myneni, and Epili Krishna Rao Patro. "A comparison
on visual prediction models for MAMO (multi activity-multi object) recognition using
deep learning." Journal of Big Data 7.1 (2020): 24.
4. Jaouedi, Neziha, Noureddine Boujnah, and Med Salim Bouhlel. "A new hybrid
deep learning model for human action recognition." Journal of King Saud University-
Computer and Information Sciences 32.4 (2020): 447-453.