Major Project Report Template
Major Project Report Template
Major Project Report Template
SAGAR (M.P.)
A
PROJECT REPORT ON
Submitted to
Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal (M.P.)
In partial fulfilment of the degree
Of
Bachelor of Technology
In
Information Technology
Guided By: Submitted To:
Dr. Rakhi Prof. R.S.S. Rawat
Mrs. Poonam Vinode (Head of Department)
Department of Information Technology Department of Information Technology
I.G.E.C. Sagar (M.P.) I.G.E.C. Sagar (M.P.)
Submitted By:
Akshat Jain : 0601IT203D01
Aastha Dubey : 0601IT191002
Himanshu Bawankar : 0601IT191024
5
ii
INDIRA GANDHI ENGINEERING COLLEGE
SAGAR (M.P.)
CERTIFICATE
This is to Certify that Akshat Jain, Aastha Dubey and Himanshu Bawankar of
B.Tech. 7th Semester, Information Technology Engineering have completed Project report on
Sign Language Recognizer towards the partial fulfilment of the requirement for the award
of the degree of Bachelor of Technology in Information Technology of Rajiv Gandhi
Proudyogiki Vishwavidyalaya, Bhopal (M.P.) for the session 2022-2023.
The work presented in this report has been carried out by them under our guidance and
supervisions.
Guided by :
Dr. Rakhi
iii
ACKNOWLEDGEMENT
It is with great reverence that we express our gratitude to our guides “Dr. Rakhi, and
Mrs. Poonam Vinode”, Department of Information Technology Engineering, Indira
Gandhi Engineering College for their precious guidance and help in this project work.
The credit for the successful completion of this project goes to their keen interest timing
guidance and valuable suggestion otherwise our endeavour would have been futile.
We owe to regard to “Prof. R.S.S. Rawat”, Head of Department of Information
Technology Engineering for her persistent encouragement and blessing which were
bestowed upon us.
We owe to our sincere thanks to the honourable Principal “Dr. Anurag Trivedi”, for his
kind support which he rendered us in the envisagement for great success of our project.
iv
INDIRA GANDHI ENGINEERING COLLEGE
SAGAR (M.P.)
DECLARATION
v
ABSTRACT
The goal of this project was to build a neural network able to classify which letter of
the American Sign Language (ASL) alphabet is being signed, given an image of a
signing hand. This project is a first step towards building a possible sign language
translator, which can take communications in sign language and translate them into
written and oral language. Such a translator would greatly lower the barrier for many
deaf and mute individuals to be able to better communicate with others in day to day
interactions. This goal is further motivated by the isolation that is felt within the deaf
community. Loneliness and depression exists in higher rates among the deaf
population, especially when they are immersed in a hearing world [1]. Large barriers
that profoundly affect life quality stem from the communication disconnect between
the deaf and the hearing. Some examples are information deprivation, limitation of
social connections, and difficulty integrating in society [2]. Most research
implementations for this task have used depth maps generated by depth camera and
high resolution images. The objective of this project was to see if neural networks
are able to classify signed ASL letters using simple images of hands taken with a
personal device such as a laptop webcam. This is in alignment with the motivation as
this would make a future implementation of a real time ASL-to-oral/written
language translator practical in an everyday situation. The sign language is used
widely by people who are hearing impaired as a medium for communication. A sign
language is nothing but composed of various gestures formed by different shapes of
hand, its movements, orientations as well as the facial expressions. Deaf people have
very little or no hearing ability .They use sign language for communication. People
use different sign languages in different parts of the world.
There are around 466 million people worldwide with hearing loss and 34 million of
these are children there are many new techniques that have been developed recently
in this field. In this project, we are going to develop a system for conversion of
American Sign Language (ASL) into text using OpenCV and also speak
1
TABLE OF CONTENTS
1. Introduction 3
2. Objective 3
INTRODUCTION
2
Communication is very crucial to human beings, as it enables us to express
ourselves. We communicate through speech, gestures, body language, reading,
writing or through visual aids, speech being one of the most commonly used among
them. However, unfortunately, for the speaking and hearing impaired minority,
there is a communication gap. Visual aids, or an interpreter, are used for
communicating with them. However, these methods are rather cumbersome and
expensive, and can't be used in an emergency. Sign Language chiefly uses manual
communication to convey meaning. This involves simultaneously combining hand
shapes, orientations and movement of the hands, arms or body to express the
speaker's thoughts.
Sign Language consists of fingerspelling, which spells out words character by
character, and word level association which involves hand gestures that convey the
word meaning. Fingerspelling is a vital tool in sign language, as it enables the
communication of names, addresses and other words that do not carry a meaning in
word level association. In spite of this, fingerspelling is not widely used as it is
challenging to understand and difficult to use. Moreover, there is no universal sign
language and very few people know it, which makes it an inadequate alternative for
communication.
A system for sign language recognition that classifies finger spelling can solve this
problem.Various machine learning algorithms are used and their accuracies are
recorded and compared in this report.
OBJECTIVE
Sign Language Recognizer (SLR) is a tool for recognizing sign language of deaf and
dumb people of the world.
The project aims to develop interactive application software for translation of sign. This
is a capable Tool easily converts sign language symbols into English text and speak. This
tool will be very useful for both teaching and learning American Sign Language (ASL).
The language translator should be able to translate 26 alphabet gestures from the
American Sign Language (ASL).
3
SOFTWARE REQUIREMENT SPECIFICATION
PURPOSE
Deaf people have very little or no hearing ability .They use sign language for
communication. People use different sign languages in different parts of the world.
In order to combine and making a unique language system for them
Deaf Culture
Sign language users make up cultural minorities, united by common languages and
life experience. Many people view deafness not as a disability, but as a cultural
identity with many advantages. When capitalized, “Deaf” refers to this cultural
identity, while lowercase “deaf” refers to audiological
status. Like other cultures, Deaf cultures are characterized by unique sets of norms
for interacting and living. Sign languages are a central component of Deaf cultures,
their role in Deaf communities even characterized as sacred . Consequently,
development of sign language processing systems is highly
educating deaf children, not sign language . Subsequently, oralism was widely
enforced, resulting in training students to lip-read and speak, with varying success.
Since then, Deaf communities have fought to use sign languages in schools, work,
and public life. Linguistic work has helped
gain respect for sign languages, by establishing them as natural languages [107].
Legislation has also helped establish legal support for sign language education and
use (e.g., [5]). This historical struggle can make development of sign language
software particularly sensitive in the Deaf community.
1
Sign Language Linguistics
Just like spoken languages, sign languages are composed of building blocks, or
phonological features, put together under certain rules. The seminal linguistic
analysis of a sign language (ASL) revealed that each sign has three main
phonological features: handshape, location on the body, and movement. More
recent analyses of sign languages offer more sophisticated and detailed
phonological analyses. While phonological features are not always meaningful
(e.g., the bent index finger in the sign APPLE does not mean anything on its own),
they can be [18]. For example, in some cases the movement of the sign has a
grammatical function. In particular, the direction of movement in verbs can indicate
the subject and object of the sentence. Classifiers represent classes of nouns and
verbs – e.g., one hand shape in ASL is used for vehicles, another for flat objects,
and others for grabbing objects of particular shapes. The vehicle hand shape could
be combined with a swerving upward movement to mean a vehicle swerving uphill,
or a jittery straight movement for driving over gravel. Replacing the hand shape
could indicate a person walking instead. These hand shapes, movements, and
locations are not reserved exclusively for classifiers, and can appear in other signs.
Recognition software must differentiate between such usages. Fingerspelling, where
a spoken/written word is spelled out using hand shapes representing letters, is
prevalent in many sign languages. For example, fingerspelling is often used for the
names of people or organizations taken from spoken language. Its execution is
subject to a high degree of articulation, where hand shapes change depending on the
neighboring letters [67]. Recognition software must be able to identify when a hand
shape is used for fingerspelling vs. other functionalities. Sign languages are not
entirely expressed with the hands; movement of the eyebrows, mouth, head,
shoulders, and eye gaze can all be critical.
2
by adding mouth movements – e.g., executing the sign CUP with different mouth
positions can indicate cup size. Sign languages also make extensive use of
depiction: using the body to depict an action (e.g., showing how one would fillet a
fish), dialogue, or psychological events. Subtle shifts in body positioning and eye
gaze can be used to indicate a referent. Sign language recognition software must
accurately detect these non-manual components.There is great diversity in sign
language execution, based on ethnicity, geographic region, age, gender, education,
language proficiency, hearing status, etc.
SCOPE
3
REQUIREMENT
Hardware Requirement –
• Windows OS (7, 8, 10 & 11), Linux OS.
• Ram minimum 4GB.
• High Quality Camera.
• Speakers
Software –
• Python Launcher
DEPENDENCIES
Tools
• Python,
• OpenCV (4.5.4.60)
• cvzone, NumPy, PyWin32
• TensorFlow (2.9.1), Tkinter
Environment
• PyCharm Community Edition 2022.2.1
Experiment Platform
• Windows 10 & 11 (High Quality Camera Required)
4
Advantage of Agile Model
For complex projects, the resource requirement and effort are difficult to
estimate.
Dataset Description
The ASL Fingerspelling Dataset from the University of Surrey’s Center for Vision,
Speech and Signal Processing is divided into two types: color images (A) and depth
images (B). To make our Translator accessible through a simple web-app and
laptop with a camera, we chose to only use the color images. They are close-ups of
hands that span the majority of the image surface (Fig. 1). 228 Dataset A comprises
the 24 static signs of ASL captured from 5 users in different sessions with similar
lighting and backgrounds. The images are in color. Their height-to width ratios vary
significantly but average approximately 150x150 pixels. The dataset contains over
65,000 images. The Massey University Gesture Dataset 2012 contains 2,524 close-
5
up, color images that are cropped such that the hands touch all four edges of the
frame. The hands have all been tightly cropped with little to no negative space and
placed over a uniform black background. Coincidentally, this dataset was also
captured from five users. The frames average approximately 500x500 pixels. Since
there was little to no variation between the images for the same class of each signer,
we separated the datasets into training and validation by volunteer. Four of the five
volunteers from each dataset were used to train, and the remaining volunteer from
each was used to validate. We opted not to separate a test set since that would
require us to remove one of four hands from the training set and thus significantly
affect generalizability. Instead, we tested the classifier on the web application by
signing ourselves and observing the resulting classification probabilities outputted
by the models.
Pre-processing and data augmentation Both datasets contain images with unequal
heights and weights. Hence, we resize them to 256x256 and take random crops of
224x224 to match the expected input of the Google Net. We also zero-center the
data by subtracting the mean image from ILSRVC 2012. Since the possible values
in the image tensors only span 0-255, we do not normalize them. Furthermore, we
make horizontal flips of the images since signs can be performed with either the left
of the right hand, and our datasets have examples of both cases. Since the difference
between any two classes in our datasets is subtle compared to ILSRVC classes, we
attempted padding the images with black pixels such that they preserved their
aspect ratio upon resizing. This padding also allows us to remove fewer relevant
pixels upon taking random crops.
SYSTEM DESIGN
As shown in Figure, the project will be structured into 3 distinct functional
blocks,Data Processing, Training, and Classify Gesture.
6
Data Processing: The load data.py script contains functions to load the Raw Image
Data and save the image data as numpy arrays into file storage. The process data.py
script will load the image data from data.npy and preprocess the image by
resizing/rescaling the image, and applying filters and ZCA whitening to enhance
features. During training the processed image data was split into training,
validation, and testing data and written to storage. Training also involves a load
dataset.py script that loads the relevant data split into a Dataset class. For use of the
trained model in classifying gestures, an individual image is loaded and processed
from the filesystem.
Training: The training loop for the model is contained in train model.py. The
model is trained with hyper parameters obtained from a config file that lists the
learning rate, batch size, image filtering, and number of epochs. The configuration
used to train the model is saved along with the model architecture for future
evaluation and tweaking for improved results. Within the training loop, the training
and validation datasets are loaded as Data loaders and the model is trained using
Adam Optimizer with Cross Entropy Loss. The model is evaluated every epoch on
the validation set and the model with best validation accuracy is saved to storage for
further evaluation and use. Upon finishing training, the training and validation error
and loss is saved to the disk, along with a plot of error and loss over training.
Classify Gesture: After a model has been trained, it can be used to classify a new
ASL gesture that is available as a file on the file system. The user inputs the file
path of the gesture image and the test data.py script will pass the file path to process
data.py to load and preprocess the file the same way as the model has been trained.
Sources of Data
Data Collection
The primary source of data for this project was the compiled dataset of American
Sign Language (ASL) called the ASL Alphabet from Kaggle user Akash [3]. The
dataset is comprised of 87,000 images which are 200x200 pixels. There are 29 total
classes, each with 3000 images, 26 for the letters A-Z and 3 for space, delete and
nothing. This data is solely of the user Akash gesturing in ASL, with the images
7
taken from his laptop’s webcam. These photos were then cropped, rescaled, and
labelled for use.
Examples of images from the Kaggle dataset used for training. Note difficulty of
distinguishing fingers in the letter E. A self-generated test set was created in order
to to investigate the neural network’s ability to generalize. Five different test sets of
images were taken with a webcam under different lighting conditions, backgrounds,
and use of dominant/non-dominant hand. These images were then cropped and
preprocessed. Language involves many features which are based around the hands,
in general there are hand shape/orientation (pose) and movement trajectories, which
are similarin principle to gestures. A survey of GR was performed by Mitra and
Acharya [70]giving an overview of the field as it stood in 2007. While many GR
techniques areapplicable, Sign language offers a more complex challenge than the
traditionally more confined domain of gesture recognition.
8
Data Preprocessing
The data preprocessing was done using the PILLOW library, an image processing
library, and sklearn, decomposition library, which is useful for its matrix
optimization and decomposition functionality.
Image Whitening: ZCA, or image whitening, is a technique that uses the singular
value decomposition of a matrix. This algorithm decorrelates the data, and removes
the redundant, or obvious, information out of the data. This allows for the neural
network to look for more complex and sophisticated relationships, and to uncover
the underlying structure of the patterns it is being trained on. The covariance matrix
of the image is set to identity, and the mean to zero.
9
Design of Recognition
Implementation Model
The model used in this classification task is a fairly basic implementation of a
Convolutional Neural Network (CNN). As the project requires classification of
images, a CNN is the go-to architecture. The basis for our model design came from
Using Deep Convolutional Networks for Gesture Recognition in American Sign
Language paper that accomplished a similar ASL Gesture Classification task [4].
This model consisted of convolutional blocks containing two 2D Convolutional
Layers with ReLU activation, followed by Max Pooling and Dropout layers. These
convolutional blocks are repeated 3 times and followed by Fully Connected layers
that eventually classify into the required categories. The kernel sizes are maintained
at 3 X 3 throughout the model. Our originally proposed model is identical to the one
from the aforementioned paper, this model is shown in Figure 5. We omitted the
dropout layers on the fully connected layers at first to allow for faster training and
to establish a baseline without dropout.
10
We also decided to design a separate model to compare with the model in the paper.
This model was designed to be trained faster and to establish a baseline for problem
complexity. This smaller model was built with only one “block” of convolutional
layers consisting of two convolutional layers with variable kernel sizes progressing
from 5 X 5 to 10 X 10, ReLU activation, and the usual Max Pooling and Dropout.
This fed into three fully connected layers which output into the 29 classes of letters.
The variation of the kernel sizes was motivated by our dataset including the
background, whereas the paper preprocessed their data to remove the background.
The design followed the thinking that the first layer with smaller kernel would
capture smaller features such as hand outline, finger edges and shadows. The larger
kernel hopefully captures combinations of the smaller features like finger crossing,
angles, hand location, etc.
Model Performance
Our models were trained using Adam optimizer and Cross Entropy Loss. Adam
optimizer is known for converging quickly in comparison with Stochastic Gradient
Descent (SGD), even while using momentum. However, initially Adam would not
decrease our loss thus we abandoned it to use SGD. Debugging Adam optimizer
after our final presentation taught us that lowering learning rate significantly can
help Adam to converge during training. Thus, allowing us to train more models
towards the end of our project. Of our two models, the one based on the paper was
shown not to be viable, as it took much longer to train without showing any
significant decrease in accuracy or loss. We believe this is likely due to the more
difficult nature of our classification with the inclusion of background in the images
and the lower resolution, causing training to be more difficult. Thus, we decided to
focus on improving our smaller model which initially trained to 40% validation
accuracy. Although we had a very large dataset to work with; 3,000 samples for
each of 29 classes, after processing the images into numpy arrays, we found our
personal computers could load a maximum of 50-100 samples/class and our Google
Cloud server could load 200 samples/class. The need to load small datasets actually
led us to test the effect of increasing the data available to our models. Training our
initial models on less data led to the models quickly over fitting as shown in Figure
7. This is likely due to the small amount of samples to train on leading to bad
generalization and learning of the sample space. Increasing the size of our dataset to
11
200 samples/class led to better model results, with peak validation accuracy of
60.3% in epoch 17. However, taking a look at our loss function, we see that the
validation loss is increasing, indicating over fitting of the model. After we applied
filtering, enhancement, and ZCA whitening to our dataset, the model performance
increased drastically as shown in Figure 9. The peak validation accuracy achieved is
77.25% in epoch 24. As shown by the plot of loss, the validation loss is still
decreasing, albeit at a slower rate than the training loss, indicating that the model is
not drastically overfitting. This shows that preprocessing our images by applying
filters and ZCA whitening helps to enhance relevant features for the model to learn.
To determine whether our preprocessing of images actually results in a more robust
model, we verified on a test set comprised of images from the original dataset, and
our own collected image data. The performance of the models on the test. We see
that the model trained on preprocessed images performs much better than the model
trained on the original images, likely due to the former’s lack of over fitting.
Work in the field of sign language linguistics has informed the features used for de-
tection. This is clearly shown in work which classifies in two stages; using first a
signsub-unit layer, followed by a sign level layer. This offers SLR the same
advantagesas it offered speech recognition. Namely a scalable approach to large
vocabulariesas well as a more robust solution for time variations between
examples.The early work of Vogler and Metaxas [99] borrowed heavily from the
studiesof sign language by Liddell and Johnson [64], splitting signs into motion and
pausesections. While their later work [101], used PaHMMs on both hand shape and
mo-tion sub-units, as proposed by the linguist Stokoe [95]. Work has also
concentratedon learning signs from low numbers of examples. Lichtenauer et al.
[63] presenteda method to automatically construct a sign language classifier for a
previously un-seen sign. Their method works by collating features for signs from
many peoplethen comparing the features of the new sign to that set. They then
construct a newclassification model for the target sign. This relies on a large
training set for the basefeatures (120 signs by 75 people) yet subsequently allows a
new sign classifier to betrained using one shot learning. Bowden et al. [12] also
presented a sign languagerecognition system capable of correctly classifying new
signs given a single trainingexample. Their approach used a 2 stage classifier bank,
12
the first of which used hardcoded classifiers to detect hand shape, arrangement,
motion and position sub-units.The second stage removed noise from the 34 bit
feature vector (from stage 1) us-ing ICA, before applying temporal dynamics to
classify the sign. Results are veryhigh given the low number of training examples
and absence of grammar. Kadir etal. [53] extended this work with head and hand
detection based on boosting (cas-caded weak classifiers), a body-centered
description (normalises movements into a2D space) and then a 2 stage classifier
where stage 1 classifier generates linguis-tic feature vector and stage 2 classifier
uses Viterbi on a Markov chain for highestrecognition probability. Cooper and
Bowden [18] continued this work still further
with an approach to SLR that does not require tracking. Instead, a bank of
classifiersare used to detect phoneme parts of sign activity by training and
classifying (Ad-aBoost cascade) on certain sign sub-units. These were then
combined into a secondstage word-level classifier by applying a 1st order Markov
assumption. The resultsshowed that the detection rates achieved with a large
lexicon and few training ex-amples were almost equivalent to a tracking based
approach.Alternative methods have looked at data driven approaches to defining
sub-units.Yin et al. [110] used an accelerometer glove to gather information about a
sign,before applying discriminative feature extraction and similar state tying
algorithms,to decide sub-unit level segmentation of the data. Kong et al. [58] and
Han et al. [41]have looked at automatic segmentation of the motions of sign into
sub-units, usingdiscontinuities in the trajectory and acceleration, to indicate where
segments beginand end, these are then clustered into a code book of possible
exemplar trajectoriesusing either DTW distance measures, in the case of Han et
al.orPCA features byKong et al.
The depth image functionality of the Kinect camera provides us the ability to
extract our region of interest, which in our case is the hand gesture, from the rest of
the background. However, given the complexity of the ISL hand gestures, the
information present in the full frame image is required in order to differentiate the
gestures better.
13
As described earlier, the depth camera returns the distance of each point in its field
of view from the camera. Now using this as a parameter, we can set a threshold,
within which the gesture can be shown. Fig 2 depicts a segmented depth image
depending upon pre assigned distance.
Since RGB image segmentation cannot be performed in any region depending upon
the distance, we can use the segmented depth image as a mask and multiply it over
the full frame image to obtain the segmented RGB image. However, a major
problem arises here. According to the properties of the Kinect v1 camera, the field
of view of the depth camera does not coincide with that of the RGB camera and
hence due to this constrain there would be a pixel mismatch when we try to perform
direct binary masking
From this it becomes quite clear that certain advanced computer vision techniques
must be used to calibrate the pixel distribution mapping from depth image to RGB
image in order to obtain one-to-one correspondence between them. Using the
calibration information provided in the website [10], the following parameters are
required for calibration
fx_rgb, fy_rgb, cx_rgb and cy_rgb denotes the intrinsic of Kinect rgb camera and
fx_d, fy_d, cx_d and cy_d denotes the intrinsic of Kinect depth camera.
14
depth (in meters) = 1.0 / (raw_depth * -0.0030711016 + 3.3309495161) --- (i)
Now each pixel in depth map is projected into a 3D metric space using the
equations (ii), (iii) and (iv)
Now we re-project each 3D point on the color image and get its color using the
equations (v),(vi) and (vii)
To overcome the above problem, we actually change the python wrapper itself in
the Kinect sdk to align the field of view of both depth and RGB map. This proved
to be the most effective solution to this underlying problem and accurate results
were obtained with no time overheads. Fig 4 depicts the images obtained after
making changes in the python wrapper.
Classification Methods
The earliest work on SLR applied NNs. However, given the success enjoyed
byHMMs in the field of speech recognition, and the similarity of the problem of
speechrecognition and SLR,HMM based classification has dominated SLR since the
mid90’s.Murakami and Taguchi [73] published one of the first papers on SLR.
Their ideawas to train a NN given the features from their dataglove and recognise
isolatedsigns, which worked even in the person independent context. Their system
failedto address segmentation of the signs in time and is trained at a sign level,
meaningthat it is not extendible to continous recognition. Kim et al. [56] used
datagloves toprovide x,y,z coordinates as well as angles, from which they trained a
Fuzzy MinMax NN to recognise 25 isolated gestures with a success rate of 85%.
Lee et al. [61]used a Fuzzy Min Max NN to recognise the phonemes of continuous
15
Korean SignLanguage (KSL) with a vocabulary of 131 words as well as
fingerspelling withoutmodeling a grammar. Waldron and Kim [103] presented an
isolated SLR systemusing NNs. They trained a first layer NN for each of the four
subunit types presentin the manual part of ASL, and then combined the results of
the first layer in asecond layer NN that actually recognises the isolated words.
Huang et al. [48] pre-sented a simple isolated sign recognition system using a
Hopfield NN. Yamaguchi recognised a very small number of words using
associative memory (simi-lar to NNs). Yang et al. [109] presented a general method
to extract motion trajecto-ries, and then used them within a Time Delay Neural
Network (TDNN) to recogniseASL. Motion segmentation is performed, and then
regions of interest were selectedusing colour and geometry cues. The affine
transforms associated with these motiontrajectories were concatenated and used to
drive the TDNN which classifies accu-rately and robustly. They demonstrated
experimentally that this method achievedconvincing results.HMMs are a technique
particularly well suited to the problem of SLR. Thetemporal aspect of SLR is
simplified because it is dealt with automatically byHMMs [83]. The seminal work
of Starner and Pentland [91] demonstrated thatHMMs present a strong technique for
recognising sign language and Grobel andAssan [36] presented a HMM based
isolated sign (gesture) recognition system thatperformed well given the restrictions
that it applied.Vogler and Metaxas [99] show that word-level HMMs are SLR
suitable, providedthat the movement epenthesis is also taken into consideration.
They showed how dif-ferent HMM topologies (context dependent vs modeling
transient movements) yielddifferent results, with explicit modeling of the epenthesis
yielding better results, andeven more so when a statistical language model is
introduced to aid classification inthe presence of ambiguity and co-articulation. Due
to the relative disadvantages ofHMMs (poor performance when training data is
insufficient, no method to weightfeatures dynamically and violations of the
stochastic independence assumptions),they coupled the HMM recogniser with
motion analysis using computer vision tech-niques to improve combined
recognition rates [100]. In their following work, Voglerand Metaxas [101]
demonstrated that Parallel Hidden Markov Models (PaHMMs)are superior to
regular HMMs, Factorial HMMs and Coupled HMMs for the recog-nition of sign
language due the intrinsic parallel nature of the phonemes. The majorproblem
though is that regular HMMs are simply not scalable in terms of handlingthe
16
parallel nature of phonemes present in sign. PaHMMs are presented as a solu-tion
to this problem by modelling parallel processes independently and combiningoutput
probabilities afterwards.Kim et al. [55] presented a KSL recognition system capable
of recognising 5sentences from a monocular camera input without a restricted
grammar. They madeuse of a Deterministic Finite Automaton (DFA) to model the
movement-stroke backto rest (to remove the epenthesis), and recognise with an
DFA. Liang and OuhY-oung [62] presented a sign language recognition system that
used data capturedfrom a single DataGlove. A feature vector was contructed that
comprised posture,position, orientation, and motion. Three different HMMs were
trained, and these arecombined using a weighted sum of the highest probabilities to
generate an overallscore. Results were good on constrained data but the method is
unlikely to gener-alise to real-world applications.Kadous [54] presented a sign
language recognition system that used instancebased learning k-Nearest Neighbours
(KNNs) and decision tree learning to classifyisolated signs using dataglove features.
The results were not as high as NN systemsor HMM based systems, therefore given
the relatively simple nature of the task suggests that recognition using instance
based learning such as KNN may not be asuitable approach.Fang et al. [28] used a
cascaded classifier that classified progressively one or twohands, hand shape and
finally used a Self Organizing Feature Map (SOFM)/HMMto classify the words.
The novelty of their approach was to allow multiple paths inthe cascaded classifier
to be taken, allowing for ’fuzziness’. Their approach was fastand robust, delivering
very good classification results over a large lexicon, but it isill-suited to a real-life
application.Other classifiers are suitable when using alternative inputs such as
Wong andCippola [105], who used a limited data set of only 10 basic gestures and
requirerelatively large training sets to train their relevance vector machine (RVM).
It shouldalso be noted that their RVM requires significantly more training time than
othervector machines but in return for a faster classifier which generalises better.
As explained earlier, ISL hand gestures are complex and traditional feature
extraction algorithms performs poorly. For example, Canny Edge detection
17
algorithm fails due to the usage of both the hands where edges of one hand can get
overlapped or nullified due to the other hand.
18
Target Output: Target of this effort is to construct a system which can classify
particular hand gestures and extract the corresponding literatures
Collected Data
19
Fusing Multi-Modal Sign Data
From the review of SLR by Ong and Ranganath [82], one of their main observa-
tions is the lack of attention that non-manual features has received in the
literature.This is still the case several years on. Much of the information in a sign is
conveyedthrough this channel, and particularly there are signs that are identical in
respect ofthe manual features and only distinguishable by the non-manual features
accompa-nying the sign. The difficulty is identifying exactly which elements are
important tothe sign, and which elements are coincidental. For example, does the
blink of thesigner convey information valuable to the sign, or was the signer simply
blinking?This problem of identifying the parts of the sign that contains information
relevant tothe understanding of the sign makes SLR a complex problem to solve.
Non-manualfeatures can broadly be divided into Facial Features which may consist
of lip move-ment, eye gaze and facial expression; and Body Posture, e.g. moving
the upper bodyforward to refer to the future, or sideways to demonstrate the change
of the subject ina dialogue. While, as described in section 3.3, there has been some
work towards thefacial features, very little work has been done in the literature
regarding the role ofbody posture in SLR. The next step in the puzzle is how to
combine the informationfrom the manual and non-manual streams.Von Agris et al.
[2] attempted to quantify the significance of non-manual fea-tures in SLR, finding
that the overall recognition rate was improved by includ ing non-manual features in
the recognition process. They merged manual featureswith (facial) non-manual
features that are modelled using an Active AppearanceModel (AAM). After
showing how features are extracted from the AAM, they pre-sented results of both
continuous and isolated sign recognition using manual fea-tures and non-manual
features. Results showed that some signs of Deutsche Gebr-densprache/German
Sign Language (DGS) can be recognised based on non-manualfeatures alone, but
generally the recognition rate increases by between 1.5% and6% upon inclusion of
non-manual features. In [3], Von Agris et al. present a com-prehensive sign
language recognition system using images from a single camera.The system was
developed to use manual and non-manual features in a PaHMMto recognise signs,
and furthermore, statistical language modelling is applied andcompared.Aran et al.
[5] compared various methods to integrate manual features and non-manual features
in a sign language recognition system. Fundamentally they haveidentified a two
20
step classification process, whereby the first step involves classify-ing based on
manual signs. When there was ambiguity, they introduced a secondstage classifier
to use non-manual signs to resolve the problem. While this mightappear a viable
approach, it is not clear from sign language linguistics that it isscalable to the full
SLR problem.
Using Linguistics
The task of recognition is often simplified by forcing the possible word sequenceto
conform to a grammar which limits the potential choices and thereby
improvesrecognition rates [91,104,12,45]. N-Gram grammars are often used to
improverecognition rates, most often bi-gram [83,44,34] but also uni-gram [10].
Bungerothand Ney [16] demonstrated that statistical sign language translation using
Bayes ruleis possible and has the potential to be developed into a real-world
translation tool.Bauer et al. [11] presented a sign language translation system
consisting of a SLRmodule which fed a translation module. Recognition was done
on word level HMMs(high accuracy rate, but not scalable), and the translation was
done using statisticalgrammars developed from the data.5.5 Generalising to More
Complex CorporaDue to the lack of adequately labelled data sets, research has
turned to weakly su-pervised approaches. Several groups have presented work
aligning subtitles withsigned TV broadcasts. Farhadi and Forsyth [29] used HMMs
with both static anddynamic features, to find estimates of the start and end of a sign,
before buildinga discriminative word model to perform word spotting on 31
different words overan 80000 frame children’s film. Buehler et al. [15] used 10.5
hours of TV data,
Sign Language Recognition 17showing detailed results for 41 signs with full
ground truth, alongside more genericresults for a larger 210 word list. They achieve
this by creating a distance metricfor signs, based on the hand trajectory, shape and
orientation and performing a bruteforce search. Cooper and Bowden [20] used hand
and head positions in combinationwith data mining to extract 23 signs from a 30
minute TV broadcast. By adaptingthe mining to create a temporally constrained
implementation they introduced a vi-able alternative to the brute force search. Stein
et al. [92] are collating a series ofweather broadcasts in DGS and German. This data
21
set will also contain the DGSglosses which will enable users to better quantify the
results of weakly supervisedapproaches.
approach, and finally showed how theycan be combined to yield superior results .
Python Modules:
Modules which we use to build this project are as follows-
• OpenCV
• Cvzone
• NumPy
• TensorFlow
• Tkinter
• OS
OpenCV-
OpenCV is the huge open-source library for the computer vision, machine
learning, and image processing and now it plays a major role in real-time
operation which is very important in today’s systems. By using it, one can process
images and videos to identify objects, faces, or even handwriting of a human.
When it integrated with various libraries, such as NumPy, python is capable of
22
processing the OpenCV array structure for analysis. To identify image pattern and
its various features we use vector space and perform mathematical operations on
these features.
The first OpenCV version was 1.0. OpenCV is released under a BSD license and
hence it’s free for both academic and commercial use. It has C++, C, Python and
Java interfaces and supports Windows, Linux, Mac OS, iOS and Android. When
OpenCV was designed the main focus was real-time applications for
computational efficiency. All things are written in optimized C/C++ to take
advantage of multi-core processing.
It’s the basic introduction to OpenCV we can continue the Applications and all the
things in our upcoming articles.
Applications of OpenCV:
There are lots of applications which are solved using OpenCV, some of them are
face recognition
OpenCV Functionality
23
• Image/video I/O, processing, display (core, imgproc, highgui)
• Object/feature detection (objdetect, features2d, nonfree)
• Geometry-based monocular or stereo computer vision (calib3d, stitching,
videostab)
• Computational photography (photo, video, superres)
• Machine learning & clustering (ml, flann)
• CUDA acceleration (gpu)
• Processing with .mp4, jpg, jpeg, png etc.
Computer Vision
Computer vision is a process by which we can understand the images and videos
how they are stored and how we can manipulate and retrieve data from them.
Computer Vision is the base or mostly used for Artificial Intelligence. Computer-
Vision is playing a major role in self-driving cars, robotics as well as in photo
correction apps.
CVZone
CVzone is a computer vision package that makes us easy to run like face detection,
hand tracking, pose estimation, etc., and also image processing and other AI
functions. At the core, it uses OpenCV and MediaPipe libraries. Check here for
more information. For many reasons, the background of the video needs to be
modified as there are so many other interruptions in the background or the
background colour doesn’t suit the person due to which background or the color
needs to be modified. So, we use the real-time background replacement technique
to substitute the backgrounds and add replace them with the desired content.
Image clipping path – This technique is used if the subject of the image has sharp
edges. All those elements that fall outside the path will be eliminated.
Image cut-out – Here we cut the required region or subject in a frame and remove
the background.
24
Image masking – If the images have frills or fine edges we can use image masking
techniques.
Erasing the background – Erasing the background of an image using any different
tools
import cv2
cap=cv2.VideoCapture(0)
detector = HandDetector(maxHands = 1)
Numpy
25
Numpy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these arrays.
Use of NumPy -
26
TensorFlow
TensorFlow is a free and open-source software library for machine learning and
artificial intelligence. It can be used across a range of tasks but has a particular
focus on training and inference of deep neural networks.
27
Tkinter
Python offers multiple options for developing GUI (Graphical User Interface).
Out of all the GUI methods, tkinter is the most commonly used method. It is a
standard Python interface to the Tk GUI toolkit shipped with Python. Python with
tkinter is the fastest and easiest way to create the GUI applications. Tkinter is the
de facto way in Python to create Graphical User interfaces (GUIs) and is
included in all standard Python Distributions. In fact, it's the only framework built
into the Python standard library.
28
OS
(folder), fetching its contents, changing and identifying the current directory, etc.
The OS module in Python provides functions for creating and removing a
directory (folder), fetching its contents, changing and identifying the current
directory, etc. You first need to import the os module to interact with the
underlying operating system.
You first need to import the os module to interact with the underlying operating
system. So, import it using the import os statement before using its functions.
Uses:
29
Result:
• The proposed system is demonstrated using 26 alphabet words.
• Moreover, many methods are limited by their failure to create a high-quality
intelligent system for organizing one or two hands.
• In this research, a robust system is proposed to organize and understand the
prediction and movement of one hand in terms of interpretation.
Experiments, Results and Analysis
Evaluation Metric We evaluates two metrics in order to compare our results with
those of other papers. The most popular criterion in the literature is accuracy in the
validation set, i.e. the percentage of correctly classified examples. One other
popular metric is top-5 accuracy, which is the percentage of classifications where
the correct label appears in the 5 classes with the highest scores. Additionally, we
use a confusion matrix, which is a specific table layout that allows visualization of
the performance of the classification model by class. This allows us to evaluate
which letters are the most misclassified and draw insights for future improvement.
Experiments For each of the following experiments below, we trained our model
on letters a-y (excluding j). After some initial testing, we found that using an initial
base learning rate of 1e-6 worked fairly well in fitting the training data - it provided
a steady increase in accuracy and seemed to successfully converge. Once the
improvements in the loss stagnated, we manually stopped the process and decreased
the learning rate in order to try and increase our optimization of our loss function.
We cut our learning rate by factors ranging from 2 to 100. Furthermore, we used the
training routine that performed best with real users on our web application
(‘2_init’). We also built models to only classify letters a-k (excluding j) or a-e to
evaluate if we attained better results with fewer classes.
layer reinitialization and learning rate multiple increase (‘1_init’): We initialized
this model with the pre-trained weights from the Google Net trained on ILSVRC
2012. We then reinitialized all the classification layers with Xavier initialization
and increased the learning rate multiple of only this layer in order to help it learn
faster than the rest of the net’s pre-trained layers.
layer reinitializaiton and learning rate multiple increase (‘2_init’) We initialized this
model with the pre-trained weights from the Google Net trained on ILSVRC 2012.
30
We then reinitialized all the classification layers with Xavier initialization and
appropriately adjusted their dimensions to match our number of classes. We
increased the learning rate multiples for the top three layers beneath each
classification head.
1 layer reinitialization, learning rate multiple increase, and increased batch size
(‘1_init’) We initialized this model with the pre-trained weights from the Google
Net trained on ILSVRC 2012. We then reinitialized all the classification layers with
Xavier initialization and increased the learning rate multiple of only this layer in
order to help it learn faster than the rest of the net’s pre-trained layers. Finally, we
drastically increased the batch size from 4 to 20.
1 layer reinitialization, uniform learning rate (‘full_lr’) We initialized this model
with the pre-trained weights from the Google Net trained on ILSVRC 2012. We
then reset all the classification layers with Xavier initialization but kept learning
rate constant through the net.
Real-time user testing
As aforementioned, we initially tested our four a – y classification models on real
users through our web application. This consisted of testing on images in a wide
range of environments and hands. A key observation was that there is no significant
correlation between the final validation accuracies of the models and their real-time
performance on our web application.
For example, ‘full_lr’ classified an input as a in over half the cases with > 0.99
probability, almost independently of the letter we showed it. It was very apparent
that the ‘2_init’ model outshined the rest even though it produced the lowest
validation accuracy. However, it still failed to consistently predict the correct letter
in the top-5. For this reason, we created the models alluded to above with five (a –
e) and ten (a – k, excluding j) classes. Testing the classifiers with fewer classes on
the live web application also yielded significantly different results from testing on
the validation set. On ten-letter classification, some letters that were classified
correctly in > 70% of cases (e.g. b, c, Fig. 7) were almost always absent in the top-5
predictions. Moreover, some letters were noticeably overrepresented in the
predictions, like a and e (Fig. 7). We attribute this to the fact that neither of these
has any fingers protruding from the center of the hand. They will thus share the
31
central mass of the hand in the pictures with every other letter without having a
contour that would exclude them.
Tracking Based Tracking the hands is a non-trivial task since, in a standard sign
language conversa-tion, the hands move extremely quickly and are often subject to
motion blur. Handsare deformable objects, changing posture as well as position.
They occlude eachother and the face, making skin segmented approaches more
complex. In additionas the hands interact with each other, tracking can be lost, or
the hands confused.In early work, the segmentation task was simplified
considerably by requiring thesubjects to wear coloured gloves. Usually these gloves
were single coloured, onefor each hand [53]. In some cases, the gloves used were
designed so that the handpose could be better detected; employing coloured markers
such as Holden andOwens [46] or different coloured fingers [44]. Zhang et al. [114]
made use of mul-ticoloured gloves (where the fingers and palms of the hands were
different colours)and used the hands geometry to detect both position and shape.
Using colouredgloves reduces the encumbrance to the signer but does not remove it
completely.
Amore natural, realistic approach is without gloves, the most common detection ap-
proach uses a skin colour model [49,7] where a common restriction is long
sleeves.Skin colour detection is also used to identify the face position such as in
[109]. Of-ten this task is further simplified by restricting the background to a
specific colour(chroma keying) [48] or at the very least keeping it uncluttered and
static [90].Zieren and Kraiss [116] explicitly modelled the background which aids
the fore-ground segmentation task. Depth can be used to allow simplification of the
problem.Hong et al. [47] and Grzeszcuk et al. [37] used a stereo camera pair from
which theygenerated depth images which were combined with other cues to build
models ofthe person(s) in the image.
Fujimura and Liu [32] and Hadfield and Bowden [38]segmented hands on the naive
assumption that hands will be the closest objects tothe camera.It is possible to base
a tracker solely on skin colour as shown by Imgawa etal. [49] who skin segmented
the head and hands before applying a Kalman filterduring tracking. Han et al. [40]
also showed that the Kalman filter enabled the skinsegmented tracking to be robust
32
to occlusions between the head and hands, whileHolden et al. [45] considered snake
tracking as a way of disambiguating the headfrom the hands.
They initialised each snake as an ellipse from the hand positionon the previous
frame, using a gradient based optical flow method and shifted theellipse to the new
object position, fitting from that point. This sort of tracker tends to be non-robust to
cluttered or moving backgrounds and can be confused by signerswearing short
sleeved clothes. Akyol and Alvarado [4] improved on the originalcolour based skin
segmented tracker, by using a combination of skin segmentationand motion history
images (MHIs) to find the hands for tracking. Awad presented a face and hand
tracking system that combined skin segmentation, framedifferencing (motion) and
predicted position (from a Kalman filter) in a probabilisticmanner.
These reduced the confusion with static background images but continuedto suffer
problems associated with bare forearms.Micilotta and Bowden [68] proposed an
alternative to colour segmentation, de-tecting the component parts of the body using
Ong and Bowden’s detector [80] andusing these to infer a model of the current
body posture, allowing the hand positionsto be tracked across a video sequence.
Buehler et al. implemented a robust tracker,which labelled data to initialise colour
models, head/torso detector and Histogramof Oriented Gradients (HOG) pictorial
descriptors. It used the distinctive frames ina sequence in much the same way that
key frames are used in video encoding, theyconstrain adjacent frames and as such
several passes can be made before the finaltrajectory is extracted.
An alternative to this is the solution proposed by Zieren andKraiss who tracked
multiple hypotheses via body modelling, disambiguatingbetween these hypotheses
at the sign level. These backward/forward methods fordetermining the hand
trajectories offer accurate results but at the cost of processingtime. Maintaining a
trajectory after the hands have interacted also poses a problem.Shamaie and
Sutherland tracked bi-manual gestures using a skin segmentationbased hand tracker,
which calculated bounding box velocities to aid tracking afterocclusion or contact.
While adaptable to real time use, it suffers from the same prob-lems as other colour
only based approaches. Dreuw et al. used dynamic program-ming to determine the
path of the head and hands along a whole video sequence,avoiding such failures at
the local level [24] but negating the possibility of real-timeapplication.
33
Non-Tracking Based Since the task of hand tracking for sign language is a non-
trivial problem, therehas been work where signs are detected globally rather than
tracked and classified.Wong and Cippola [105] used Principal Component Analysis
(PCA) on motion gra-dient images of a sequence, obtaining features for a Bayesian
classifier. Zahedi etal. investigated several types of appearance based features.
They started by usingcombinations of down-sampled original images, multiplied by
binary skin-intensityimages and derivatives. These were computed by applying
sobel filters [112]. Theythen combined skin segmentation with five types of
differencing for each frame ina sequence, all are down sampled to obtain features
[113]. Following this, their ap-pearance based features were combined with the
tracking work of Dreuw et al. [24]and some geometric features in the form of
moments. Creating a system whichfuses both tracking and non-tracking based
approaches [111].
This system is ableto achieve 64% accuracy rates on a more complex subset of the
Boston dataset [75]
Sign Language Recognition 7including continuous sign from three signers. Cooper
and Bowden [19] proposed amethod for sign language recognition on a small sign
subset that bypasses the needfor tracking entirely. They classifie
Conclusion
• This project basically deals with the different algorithm and techniques used for
recognizing the hand gesture.
• Hand gesture recognition system is considered as a way for more intuitive and
proficient human computer interaction tool.
• The range of applications includes virtual prototyping, sign language analysis and
medical training.
• Sign language is one of the tools of communication for physically impaired, deaf and
dumb people.
• From the above consideration it is clear that the vision based hand gesture recognition
has made remarkable progress in the field of hand gesture recognition.
Future Scope
34
• The accuracy of the program can be further improvised by using neural networks.
• The use can be further designed to make more accessible to the consumers.
• The whole point of making the solution as a commercially viable product for the
users is to help the impaired community around the world.
• We focused our efforts on optimizing GoogLeNet, but it would be worth exploring
different nets that have also been proven effective at image classification (e.g. a
VGG or a ResNet architecture).
• We believe that the classification task could be made much simpler if there is very
heavy preprocessing done on the images. This would include contrast adjustment,
background subtraction and potentially cropping.
• A more robust approach would be to use another CNN to localize and crop the
hand.
• Language Model Enhancement
• Building a bigram and trigram language model would allow us to handle sentences
instead of individual words.
• Along with this come a need for better letter segmentation and a more seamless
process for retrieving images from the user at a higher rate.
Future Enhancement:
• Making more user friendly.
• Added more complex feature.
References:
1. Nicoletta Adamo-Villani and Ronnie B. Wilbur. 2015. ASL-Pro: American Sign
Language Animation with Prosodic Elements. In Universal Access in Human-
Computer Interaction. Access to Interaction, Margherita Antona and Constantine
Stephanidis (Eds.). Springer International Publishing, Cham, 307–318.
2. M Ebrahim Al-Ahdal and Md Tahir Nooritawati. 2012. Review in sign language
recognition systems. In 2012 IEEE Symposium on Computers & Informatics
(ISCI). IEEE, 52–57.
3. Sedeeq Al-khazraji, Larwan Berke, Sushant Kafle, Peter Yeung, and Matt
35
Huenerfauth. 2018. Modeling the Speed and Timing of American Sign Language to
Generate Realistic Animations. In Proceedings of the 20th International ACM
SIGACCESS Conference on Computers and Accessibility. ACM, 259–270.
4. Anwar AlShammari, Asmaa Alsumait, and Maha Faisal. 2018. Building an
Interactive E-Learning Tool for Deaf Children: Interaction Design Process
Framework. In 2018 IEEE Conference on e-Learning, e-Management and e-
Services (IC3e). IEEE, 85–90.
5. UN General Assembly. 2006. Convention on the Rights of Persons with
Disabilities. GA Res 61 (2006), 106. British Deaf Association. 2015. George Veditz
Quote - 1913. (2015). https://vimeo.com/132549587 Accessed 2019-04-22.
6. J Andrew Bangham, SJ Cox, Ralph Elliott, JRW Glauert, Ian Marshall, Sanja
Rankov, and Mark Wells. 2000. Virtual signing: Capture, animation, storage and
transmission-an overview of the visicast project. (2000).
7. H-Dirksen L Bauman. 2004. Audism: Exploring the Metaphysics of Oppression.
Journal of deaf studies and deaf education 9, 2 (2004), 239–246.
8. H-Dirksen L Bauman and Joseph J Murray. 2014. Deaf Gain: Raising the Stakes for
Human Diversity. U of Minnesota Press.
9. Bastien Berret, Annelies Braffort, and others. 2016. Collecting and Analysing a
Motion-Capture Corpus of French Sign Language. In Workshop on the
Representation and Processing of Sign Languages.
10. Claudia Savina Bianchini, Fabrizio Borgia, Paolo Bottoni, and Maria De Marsico.
2012. SWift: a SignWriting improved fast transcriber. In Proceedings of the
International Working Conference on Advanced Visual Interfaces. ACM, 390–393.
36