Signlanguage Double Column 2plagg

2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT)
Sign Language Detection Using

MediaPipe and Deep Learning
979-8-3503-7809-2/24/$31.00 ©2024 IEEE

2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT)
Vidhi Bansal Sandali Sinha Rani Astya

Department of Computer Science and Department of Computer Science and Department of Computer Science and
Engineering Engineering Engineering
Sharda University Sharda University Sharda University
Greater Noida, India, Greater Noida, India, Greater Noida, India,
[email protected] [email protected] [email protected]
979-8-3503-7809-2/24/$31.00 ©2024 IEEE

Abstract— This paper provides an efficient and accurate identifying individual signs. Through an analysis of various
real-time sign language recognition system that understands CNN architectures, we intend to design a system. capable of
American Sign Language (ASL) gestures using MediaPipe and reliably translating visual data into corresponding sign
Convolutional Neural Networks (CNNs). The system captures language interpretations, facilitating smoother interactions
and processes hand gestures in real time, allowing for hands- between sign language users and nonsigners.
free command input without physical engagement. MediaPipe,
well-known for its powerful hand-tracking capabilities, is used The major contribution of this paper are:
to capture and preprocess hand movements, considerably
improving gesture recognition accuracy. This processed data is
 In this paper we have examined existing sign
fed into a CNN model trained on ASL datasets, resulting in an language detection models.
astonishing 99.81% accuracy in recognizing individual signals.  This paper compares the existing model based on
The findings emphasize the system's potential applications in various parameters like accuracy, dataset, application
communication, education, and accessibility, emphasizing its etc..
importance for people who have hearing impairments. The
combination of MediaPipe with CNNs provides a scalable  This paper also proposes a model using MediaPipe
solution for accessible, real-time sign language detection, library.
contributing to progress in inclusive technology.
 The paper also discusses the future directions about
Keywords—sign language recognition in real-time, CNN, development of machine learning in sign language
MediaPipe, ASL dataset detection.
The paper is structured as follows: Section II analyses
I. INTRODUCTION previous studies on sign language detection and the
comparison of various models in recognizing hand
gestures. Section III discusses the system we have
According to World Health Organisation (WHO) proposed using the dataset we have created. Section
estimates, around 15% of the world's population[1], or IV describes the methodology we have used in
approximately 1 billion individuals, suffer from some sort of developing the model. Section V presents the results and
impairment. Approximately 5% of the world's population, or discusses the accuracy of other existing ML models
430 million people, suffer from disabling hearing loss, which Section VI summarizes this paper and discusses the
can range from mild to severe. It is anticipated that by 2050, future directions about development of machine learning
more than 700 million people will suffer from severe hearing in sign language detection.
loss, classified as over 40 decibels in adults and over 30
decibels in children. II. .LITERATURE SURVEY
Individuals with hearing impairments can express their

opinions, sentiments, and directives through regulated hand In 2024, Verma et al.[11] (2024) achieved 99.12% accuracy
gestures and movements, making sign language a crucial in real-time ASL detection by integrating MediaPipe and
way of communication. However, a significant CNNs, enhancing communication and accessibility for
communication barrier often exists between sign language hearing-impaired individuals. A. M. et al.[12] (2024)
users and those unfamiliar with the language, posing developed a real-time ASL detection system using CNNs
challenges to inclusivity. Advancements in machine learning with TensorFlow and Keras in Python, demonstrating
and computer vision have introduced promising solutions to
advancements in computer vision and assistive technologies
this issue by facilitating the development of system’s
capability of automatically recognizing and translating sign for gesture recognition. Sindhu et al.[13] (2024) developed a
language movements[2]. The most widely used sign multimodal dynamic system for Indian Sign Language
languages include American Sign Language (ASL)[3], translation, utilizing Natural Language Processing and
Australian Sign Language (AUSLAN)[4], Arabic Sign Reversible CNN to enhance interaction for those who
Language (ArSL)[5] and Spanish Sign Language (SSL)[6]. have loss of hearing by translating gestures into text . Thong
Convolutional Neural Networks (CNNs) are an aspect of et al.[14] (2024) introduced a computer vision system for
architecture for deep learning [7] that have demonstrated real time conversion of gestural language into text,
significant effectiveness in image recognition and effectively bridging communication barriers and enhancing
classification tasks. Numerous machine learning algorithms accessibility for individuals with hearing impairments. Alam
have been created for hand gesture recognition, enabling the et al.[15] (2024) reviewed smartphone applications for sign
development of AI-driven applications. Among these,
MediaPipe is a valuable tool for recognizing hand gestures. language detection, emphasizing machine learning
[8]. MediaPipe Hands utilizes an integrated machine learning approaches from 2012-2023, advocating for universal sign
pipeline composed of multiple models working in unison.[9] language and affordable devices to enhance communication
By automating this interpretation process, these systems can for speech-disordered individuals. D. G. et al.[16] (2024)
help bridge communication gaps between deaf and normal suggested a machine learning system for recognition of
people[10] and enhance accessibility for individuals with Indian Sign Language in real-time using transfer learning
hearing impairments. with CNN and SSD MobileNet V2, achieving 85.45%
This research explores the application of MediaPipe and confidence to enhance communication accessibility.
CNN in detecting and recognizing sign language gestures.
The study investigates the accuracy, robustness, and In 2023, Deshpande et al.[17] (2023) developed a detection
efficiency of models developed using machine learning in system for Sign Language recognition using CNN,
enhancing human-computer interaction by efficiently In 2021, Ikram et al.[23] (2021) proposed a CNN method
recognizing American Sign Language gestures under using webcams and OpenCV for recognizing American Sign
varying lighting conditions for intuitive communication Language gestures, enhancing assistive tools like virtual
interfaces. Boondamnoen et al.[18] (2023) . explored LSTM assistants and improving accessibility through gesture
recognition advancements. Adeyanju et al. [24] (2021)
and CNN architectures for sign language translation, reviewed 649 publications on intelligent sign language
optimizing gesture recognition accuracy and enhancing recognition systems, highlighting advancements in machine
accessibility for the hearing impaired through improved learning while identifying challenges like cost and accuracy,
systems for sign language detection. Puchakayala et al.[19] suggesting deep learning for future improvements.
(2023) proposed a deep learning framework for recognizing
In 2020, Zhao et al.[25] (2020) introduced a recognition
American Sign Language using CNN and YOLO, system for real-time sign language detection using streams of
enhancing real-time gesture detection to Increase video, employing 3DCNN and optical flow techniques to
communication accessibility for those with hearing enhance gesture recognition and communication for the
problems. Pathan et al.[20] (2023) introduced a cost- hearing impaired. P. Das et al. [26] (2020) created
effective method for American Sign Language recognition recognition system for static hand movements for American
Sign Language based on CNN, achieving 94.34% validation
Table 1 Comparison of various models accuracy with an ASL dataset of 1,815 images, enhancing
communication for the deaf community.
In 2019, S. He et al.[27] (2019) explored a sign language
using the "Finger Spelling, A" dataset and a multiheaded translation system using deep learning, integrating Faster R-
CNN, achieving 98.98% accuracy through image processing CNN for hand detection with 3D CNN and LSTM, achieving
and hand landmark extraction. a 99% recognition rate, enhancing communication for the
hearing impaired. YunJung Ku et al.[28] (2019) presented
In 2022, Pathak et al.[21] (2022) presented a sign the Sign Language Translator smartphone app, utilizing pose
language recognition system for real time detection using a detection for hand positions and a classification model,
CNN model with SSD MobileNet V2, achieving 70-80% achieving over 80% accuracy in translating American Sign
accuracy, facilitating communication for the deaf through Language gestures into text, enhancing accessibility.
gesture recognition. Wang et al.[22] (2022) proposed an
enhanced 3DResNet algorithm for sign language recognition,
integrating Efficient Det and dual-channel spatial attention to
achieve 91.12% accuracy on CSL datasets, aiming to
improve recognition speed.
Paper Title Model Used Key Features Dataset Accuracy Key Applications
GesturetoText: A PoseNet + LSTM Realtime, lightweight, 300 ISL words, 98% Realtime ISL
RealTime Indian pose estimation, 4000 videos translation, virtual
Sign Language sequential data meetings, online
Translator processing communication
Efficient Deep YOLOv5x + Attention mechanisms, MU 98.9% (ASL), Realtime sign
Learning Models Attention realtime edge platform HandImages 97.6% language
Based on Tension Mechanisms support ASL, (BdSL) recognition of
Techniques (CBAM, SE) OkkhorNama: alphabetic and
BdSL numeric gestures
Continuous YOLOv4 + SVM Twomodel approach Custom dataset, 98.8% Education,
WordLevel Sign (YOLOv4 for 676 images, 80 (YOLOv4), communication
Language detection, SVM for ISL signs 98.62% tools for the
Recognition classification), realtime, (SVM) hearing impaired
Using an Expert expert system
System
RealTime Sign CNN Realtime detection, Sign language High Inclusive
Language lowcost hardware gestures dataset accuracy communication,
Detection Using requirement, scalable (unspecified) everyday
CNN system communication aid
Sign Language MultiHeaded CNN Fusion of image + hand Sign language Higher than Comprehensive
Recognition landmarks, endtoend gestures dataset traditional communication
Using Fusion of learning, multiheaded models tools integrating
Image and Hand architecture multiple data inputs
Landmarks
Improved 3DResNet + Enhanced hand Various sign Outperforms Realtime
3DResNet Sign Enhanced Hand features, 3D languages existing videobased sign
Language Features convolutional layers, methods language
Recognition residual learning translation
Algorithm
RealTime Sign MobileNet V2 + CNN architecture, Custom dataset 7080% Assistive
Language Transfer Learning transfer learning, communication,
Detection realtime processing, sign language
custom dataset detection in diverse
environments
Deepsign: Sign LSTM + GRU + LSTMGRU IISL2020, 1100 ~97% ISL gesture
Language InceptionResNetV2 combination for video samples per recognition,
Detection and frame analysis, feature sign potential expansion
Recognition extraction using to continuous sign
Using Deep InceptionResNetV2 language
Learning recognition
III. PROPOSED SYSTEM IV. METHODOLOGY
This research focuses on developing a technical solution for Within this article, we introduced a novel approach for
recognizing Sign Language. By leveraging machine detecting gestural language by dividing the system into four
learning, the study aims to build a system capable of main steps: Data Collection, Data Preprocessing, Model
identifying hand gestures in American Sign Language. The Training, and Real-time Prediction. We have used Python
dataset for this project includes three-dimensional images of for creating, training, and utilizing a gesture recognition
alphabetical sign gestures. The MediaPipe framework is system using computer vision, machine learning, and hand
applied to detect landmarks within these images[29]. landmark detection via the MediaPipe library. Each of these
components is explained in detail within their respective
For a real-time sign language detection system developed sections.
specifically for American Sign Language a dataset of 2600
samples was generated, including vowels and consonants 1. Data Collection
from the American Sign Language. The dataset consists of A framework is provided for capturing gesture images using
alphabet signs in American Sign Language, as specified in a webcam, allowing users to create datasets for gesture
Fig. 1. For each alphabet, 100 images are captured which recognition. The program starts by asking the user to input
are stored in designated folders. During data acquisition, the number of gestures they want to capture and the
images are captured via a webcam using Python and respective name of the captured gesture. For each gesture, it
OpenCV, which provides essential functions for real-time creates a corresponding directory inside a main folder,
computer vision. OpenCV not only facilitates machine typically named DATA_DIR, which defaults to ./data. If the
perception for commercial applications but also serves as a directories for the gestures don't already exist, they are
unified infrastructure for computer vision tasks. It includes automatically created. The webcam is accessed through the
over 2,500 optimized algorithms that enable tasks such as cv2.VideoCapture function, and the user is instructed to
face detection, object recognition, camera tracking, 3D press a specific key (usually 'A') to initiate the capture
modeling, human action classification and more[30]. The process. Once activated, the program captures multiple
results indicate that this method is effective in recognizing frames of the gesture, saving each image inside the
alphabets and gestures in American Sign Language, appropriate folder for that gesture. This setup allows for easy
suggesting its potential for identifying signs and gestures in and organized collection of gesture data, with the captured
various other languages as well. samples stored in separate directories for each class of
gesture.
2. Data Preprocessing
Gesture image data is processed by detecting and extracting
hand landmarks using MediaPipe, preparing the data for
machine learning. It begins by utilizing MediaPipe to detect
hand landmarks in each of the captured images. Once the
landmarks are reognized, the code withdraws the y and x
coordinates of the landmarks for each hand displayed. To
standardize the data, it normalizes these coordinates by
subtracting the minimum values of y and x, creating
consistent feature vectors. Each image is then labeled
according to its corresponding gesture, with the label
represented as the index of the gesture. After processing all
Fig 1. Sample Dataset
the images, the extracted features and labels are stored in a
pickle file, typically named data.pickle, for training of pyttsx3 library is used to convert predicted gesture names
machine learning models in future. This process ensures that into speech, providing auditory feedback. Additionally,
the data is well- structured and ready for tasks solemnly Pickle is utilized to save and load both the data and the
based gesture recognition. trained model for future use. Together, these components
enable a robust, real-time gesture recognition system that
3. Model Training
processes video input, predicts gestures using machine
A machine learning model is trained using the learning, and offers both visual and audio outputs.
preprocessed hand landmark data from a pickle file
(data.pickle). It begins by loading the saved data, which V. RESULT AND DISCUSSION
includes the extracted hand landmark features and
corresponding gesture labels. Train_test_split is then used to For sign language detection, a variety of machine learning
divide the data into training and test sets to ensure the model models have been used, and their performance has been
can be evaluated on unseen data. A RandomForestClassifier assessed using measures like accuracy based on the dataset.
learns to identify the movements using the features after With an accuracy of 99.81% on data, our model
being trained on the training set. Using the test set, the outperformed other models tested, including CNN (Sign
system's efficacy is assessed, and the accuracy of the results Language MNIST), Multi-headed CNN, Faster-RCNN and
is computed using accuracy_score to find the proportion of SVM as shown in Table 2. MediaPipe’s pre-trained, high-
correctly classified samples. Once the model is trained, both quality hand landmark detection pipeline allows for precise
the model and the gesture names are saved into a new pickle identification and tracking of hand movements, which is
file (model.p) for future use in gesture recognition tasks. This critical for gesture and sign language recognition which
process efficiently trains, evaluates, and stores the model for accounts for its success. Additionally, MediaPipe is
real-world applications. optimized for performance, making it both computationally
efficient and adaptable to various platforms, further
4. RealTime Gesture Prediction supporting robust, real-time gesture recognition in diverse
This involves enabling real-time gesture recognition settings. Each tested model's accuracies are shown in the
using a trained machine learning model. It begins by loading table below.
the previously trained RandomForest model and the Table 2 Accuracy of various ML algorithms
corresponding gesture names from a pickle file (model.p).
Model Dataset Accuracy
Using cv2. In order to identify hand landmarks, the code
VideoCapture records live video frames from the webcam
MediaPipe Model ASL Dataset 99.81%
and uses MediaPipe to process each frame. For every
detected hand, the code extracts the relevant features—hand
landmark coordinates—and passes them to the trained model CNN Sign Language 95%
to predict the gesture. Once a gesture is recognized, the (SignLanguage MNIST
gesture name is converted into speech using the pyttsx3 MNIST)
library, providing audio feedback. Additionally, the predicted Multi-headed CNN Fingerspelling A 98.98%
gesture is visually displayed on the video frame, along with a dataset
bounding box drawn around the detected hand using
OpenCV. This setup allows for intuitive, real-time gesture Faster- RCNN Custom dataset 82.8%
recognition with both visual and audio outputs.
SVM Custom dataset 97%
Confusion Matrix: A confusion matrix is a crucial

instrument for testing classification models, showing how
well predicted labels match actual labels across various
classes. It shows counts of false positives (positives that were
mistakenly predicted), true negatives (right negatives), true
positives (accurate positive predictions), and false negatives
(missed positives) for binary classification.
Fig. 2 Methodology
Key Libraries and Concepts:

This system integrates several powerful libraries to enable
gesture classification and prediction through machine
learning, combining real-time video processing with
predictive modeling. OpenCV (cv2) is used for capturing
video from the webcam and displaying the processed frames,
while MediaPipe is employed for detecting hand landmarks
in each frame. Scikit-learn is leveraged for tasks such as
splitting the data, training the RandomForestClassifier, and
making predictions based on hand landmark features. The
subtraction, and cropping. A more advanced approach could
involve using a separate convolutional neural network
(CNN) to localize and crop the hand for more precise gesture
recognition. Additionally, increasing the number of gestures
in the dataset will enhance the system's functionality to
recognize a broader range of signs. Furthermore, enhancing
the model to recognize facial expressions and other
contextual cues could make it more robust. Currently, the
model processes isolated signs, but future work could focus
on interpreting continuous sign language, enabling the
generation of syntax and grammar, more specifically in
relation to American Sign Language (ASL). This would pave
the way for a more comprehensive and natural system
capable of understanding fluent sign language
communication
Fig. 3 Confusion Matrix for MediaPipe Model
Accuracy: The function accuracyscore divides the number

of correct predictions by the total number of samples in the
dataset, giving a measure of how well the model performed.
It calculates accuracy using this formula:
Accuracy = Number of Correct Predictions
Total Number of Predictions
Fig. 3 Training vs. Validation Accuracy Curve
VI. CONCLUSION AND FUTURE SCOPE
In conclusion, this study shows the efficiency of using

MediaPipe and Convolutional Neural Networks (CNN) for
robust sign language detection in real-time, achieving a
notable accuracy of 99.81% for American Sign Language
(ASL) gestures. MediaPipe is used to identify hand
landmarks, which are then processed to extract essential
features for the CNN, enabling accurate classification by
capturing spatial relationships and gesture patterns. The
model’s performance highlights its potential as a practical
tool for communication, education, and human-computer
interaction for the hearing impaired. This approach shows
significant promise as a communication aid and educational
tool, as well as an innovative human-computer interaction
system.
In the future, the classification task for sign language
detection can be enhanced through extensive image
preprocessing, including contrast adjustment, background

Signlanguage Double Column 2plagg

Uploaded by

Copyright:

Available Formats

Signlanguage Double Column 2plagg

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Signlanguage Double Column 2plagg

Uploaded by

Copyright:

Available Formats

2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT)

Sign Language Detection Using

979-8-3503-7809-2/24/$31.00 ©2024 IEEE

Vidhi Bansal Sandali Sinha Rani Astya

979-8-3503-7809-2/24/$31.00 ©2024 IEEE

Individuals with hearing impairments can express their

III. PROPOSED SYSTEM IV. METHODOLOGY

Confusion Matrix: A confusion matrix is a crucial

Key Libraries and Concepts:

Fig. 3 Confusion Matrix for MediaPipe Model

Accuracy: The function accuracyscore divides the number

Fig. 3 Training vs. Validation Accuracy Curve

VI. CONCLUSION AND FUTURE SCOPE

In conclusion, this study shows the efficiency of using

You might also like