Assistive SIGN LANGUAGE Converter For DEAF AND DUMB
Assistive SIGN LANGUAGE Converter For DEAF AND DUMB
Assistive SIGN LANGUAGE Converter For DEAF AND DUMB
(GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)
Abstract— Deaf and dumb people are humans at the deepest established with interpreters. Hence, there is a need to
psychological level. Many of these people are not even exposed address the huge demand for interpreters in various education
to sign languages and it is observed that it gives a great relief and working sectors. However, according to Children of
on a psychological level, when they find out about signing to
connect themselves with others by expressing their love or Deaf Adults (CODA,) there are only about 250 certified
emotions. About 5% population in world are suffering from sign language interpreters in India to translate for a deaf
hearing loss. Deaf and dumb people use sign language as their population of between 1.8 million and 7 million, which are
primary means to express their thoughts and ideas to the people mostly either unwilling or unqualified to be interpreter. In
around them with different hand and body gestures. There are addition, the rate of unemployment for deaf or dumb persons
only about 250 certified sign language interpreters in India
for a deaf population of around 7 million. In this work, the is approximately 75% which need to be addressed by the
design of prototype of an assistive device for Deaf-mute people government. It is required to increase percentage of educated
is presented so as to reduce this communication gap with the deaf or dumb people either by increasing the number of
normal people. This device is portable and can hang over the interpreters or making assistive devices available to replace
neck. This device allows the person to communicate with sign the interpreters. It is possible to design an efficient portable
hand postures in order to recognize different gestures based
signs. The controller of this assistive device is developed for assistive device by employing technology advancement in the
processing the images of gestures by employing various image- computing devices, deep learning algorithms and Internet of
processing techniques and deep learning models to recognize things.
the sign. This sign is converted into speech in real-time using
text-to-speech module. II. L ITERATURE R EVIEW
In 1970, a famous commercial glove called VPL data
I. INTRODUCTION
glove was designed which consisted of protected optical fibre
According to the recent survey by the WHO (World sensors along the back of the fingers. A glove-environment
Health Organization)[1], around 5% of the population in the system developed by Starner and Pentland could identify
world, that is over 460 million individuals worldwide have upto 45 signs from the ASL with a speed of 5Hz. Later
hearing problem, out of which about 12.3 million belong with more experimental research a SmartGlove was invented
to India. In India, the provision for deaf schools, other deaf which could extract skeletal motion and convert it to the
committees and organizations are limited. The reason behind corresponding gesture[2]. The major drawback of these sys-
this is that only 4.5 million people succeeded to complete tem were that they were bulky and expensive, so in order
their education from deaf schools as they are not qualified to overcome this a vision based system needs to come into
for normal educational institutions. It is reported that only existence. This system will overcome the size problem of
about 478 government funded schools and approximately sensor based gloves and will have a higher accuracy in
372 private schools for the deaf-mute are made available in recognizing signs in real-time.
India. These classrooms use only the “verbal approach” so Another sensor based sign Language Translator provided
there are rarely any school that uses signs to communicate by Cornell university[3] was able to gather three dimensional
with the deaf students. In rural areas, many deaf people grow information regarding every movement of hand or finger
up without exposure to the sign languages causing a psycho- with the help of accelerometer and various contact and flex
logical barrier which reflects on their behaviour and attitude. sensors.Even this system is bulky and expensive compared to
Since communication is a part of our psychological make the vision based systems. The system proposed in Real Time
up, many of deaf or dumb people experience communication Based Bare Hand Gesture Recognition[4] recognizes the bare
issues daily such as no sign language interpreters, unqualified hand gestures in real time with the help of DVS (dynamic
sign language interpreter, lack of understanding from other vision sensor) camera for recognizing upto three signs.
people, intolerant people, refusal to communicate, lipreading Though it was implemented in real-time environment, the
and attitudes. This situation can be physically draining, but applications of this system were limited such as mouse free
the psychological trauma experienced by them can be life interface. A method for converting ASL gestures to speech
long if there is no relief. using graphical programming(LabVIEW) was proposed by
Sign language is not introduced to many people in rural keerthi[5].
areas and hence they hesitate to interact with others. These A Study on Hand Gesture Recognition Technique[6] pre-
problems can be addressed by government, if schools are sented various efficient classification techniques, one of them
303
Fig. 4: Sign Language
Fig. 3: Prototype Converter
C. Speaker
A. Generating Network Layers
The speaker gives the audio output of the text recognized
In order to develop complex architecture for classification,
by the controller. The controller uses Text-To-Speech to
it is required to stack different types of feature extracting
convert the recognized text and pass it on to the speaker.
layers together. In general, neural network involves the
VI. CONVOLUTIONAL NEURAL NETWORK following four different types of layers:
Convolutional neural network is a class of deep, feed- • Convolution Layer
forward artificial neural networks and has been tested and • Pooling / subsampling
proved as the best network for analyzing and recognizing • Non-linear layers
images. It offers higher efficiency as compared to all other • Fully connected layers
types of neural networks available for multiclass image The hyper parameters which define the output volume are:
classification. 1) Depth: It indicates the total number of filters used in
For performing classification, a certain pattern, shape and the layer and each filter learns to look for a different
many other features have to be identified. For a computer, kind of feature in the input-image.
images are just an array of pixels ranging from a value of 2) Stride: It indicates how the filter is slided over the
0 to 255 and are indicated by an array of XxYx3 (1 if the image. If stride is 1, the filter slides over the image
image is gray-scale). These pixel values have no meaning to pixel by pixel. Similarly, if stride is 2, the filter slides
us but they are used by the neural-network in the training two pixels at a time.
process to create a model. 3) Padding: Along the boundary of input image, few
The image captured by the camera undergoes pre- zeroes are padded to regulate the size of output volume.
processing techniques to separate the hand from the back-
ground image. This thresholded image with hand indicated B. Architecture
in white and background in black is passed on to the CNN The layers used in the CNN architecture developed for the
model. This model is trained on a self-made data set of proposed system are shown in Fig.6. The number of layers
around 32,000 images. This trained model correctly classifies depends on the problem specification. CNN architecture
the hand gesture performed in the image and returns the involves a number of layers(typically, varying from 8 to 22)
corresponding alphabet.
304
4) Fully Connected layer: In this layer each neuron has
a connection with every other node.
C. Dataset
The self-made dataset is created which consists of around
Fig. 6: CNN Architecture 1200 images of each of the 26 Alphabets and few commonly
used words like emergency, food, water, need(see sixth row
of Fig.7). The dataset is obtained from a video sample from
where the first few layers comprises of convolutional layer, which the frames are extracted, processed and stored.The
followed by pooling layer and activation (nonlinear function) thresholded image of each alphabet and commonly used
layers[10]. Then the final few layers involve fully-connected words are shown in Fig.7
layers in which each node is connected to every other node.
For classification purposes, the final layer is a Dense layer
with activation of softmax function which gives an output of
probability distribution[11]. The class which has the highest
probability among all classes is chosen as the final output
prediction.This softmax activation layer is differentiable and
resilient to outliers[12]. The class which has the highest
probability among all classes is chosen as the final output
prediction.This softmax activation layer is differentiable and
resilient to outliers. The first layer is a conv2D layer which
uses the sigmoid activation-function. This layer takes input
as the image pixels and outputs marginal probabilities which
is passed on to the next layer.
1) Convolutional layer: This layer is designed to use a
5x5 filter with a depth equal to the input image depth for
the proposed system.This filter slides over the input image
Fig. 7: Self-made Dataset
and gets convolved with that receptive field. The size of
the region to be scanned is given by a parameter called
filter size. The size of the output generated by this layer can
VII. R EAL T IME I MPLEMENTATION
be computed as
In various computer-vision applications, Background-
Outputsize = (Inputsize f−
iltersize+2∗P adding)/stride+1 subtraction is considered as a major pre-processing step. For
2) Pooling Layer: The Pooling layer uses a 2x2 filter example, a traffic camera where the information regarding
to down-sample the output obtained from the convolutional the passing vehicles needs to be extracted. In this scenario,
layer. In this layer, the output size of the convolutional the vehicles are tracked after the background is separated
layer is reduced for avoiding over-fitting problem which may from the video. There are two methods available in the
occur due to small data-set. There are several non-linear literature[13] for removing the background to extract the
functions to implement pooling like max, min, average. Max foreground object. We have employed these methods in
pooling layer extracts the highest value in the given pool the design of the proposed Sign Language Converter for
size, whereas the average pooling takes the average value identifying the hand region in the input image captured by
of all the values of that pool size and the min pooling the camera and these methods are explained below:
extracts the lowest value in the given pool size. In this • First method involves capturing an image when the hand
model, max pooling is chosen which selects the maximum is not present so as to get the clear background image
pixel among the ones within the filter. This layer, with model. Once the background-model is obtained, every
strides of 2x2 is able to cut down the amount of parameters frame in the video undergoes background subtraction by
to learn and provides basic translation invariance to the deducting the frame value with the background-model.
internal representation. This effectively decreases the amount In this way, background is completely removed and the
of weights by 75% and lessens the time taken for training pixels containing the hand are identified. An OpenCV
the model and reduces the computational cost. The output class called BackgroundSubtractorMOG2 is used which
size of this pooling layer can be computed as creates a background model and it constantly updates
the background model to accommodate for changes in
Outputsize = (Inputsize−poolsize+2∗P adding)/stride+1
background.
3) Activation layer: This layer follows the convolutional • For tracking a particular object, the corresponding pixel
layer and the main objective of this layer is to introduce value is noted and the image is scanned pixel by pixel
non linearity in the output of the convolutional-layer which to detect the required object. The skin hsv model[14]
computes linear operations. is utilized to differentiate hand of the user from the
305
background. If the pixel contains the hand, it is indicated computed after every epoch is noted. The variation of loss
by 1 and the rest is marked as 0. with number of iterations/epoch is plotted in Fig.9. The
overall loss of the model is observed as 0.0348 after all
VIII. RESULTS AND DISCUSSION twenty epoch.
The prototype of Sign Langauge Converter presented in
this paper is an assistive device and experimentally found
to be portable enough to carry around for day-to-day work.
The circuitry of the device is concealed in a transparent case
which is small enough to be carried around. The functionality
of sign recognition system developed in this prototype is
tested using a number of different test data in order to
verify its working for a variety of background and working
conditions.
In designing this sign language converter, the controller
is implemented using Raspberry Pi 3 model B which has Fig. 9: Decreasing loss with every epoch
support for the deep learning and OpenCV libraries required
for the development of the software. The user carrying this In this sign language converter, video is captured through
prototype communicates with the surrounding people by the camera module attached to the controller. Then the
placing the hand gestures in front of the camera as shown in background-subtraction is done on each frame and the thresh-
Fig.4. This camera captures the image and transfers to the olded hand image is passed on to the CNN model. The
camera port of Raspberry pi. The Raspberry pi processes model which is a multi-class classifier outputs the most
the image and identifies the sign which is further converted probable sign it recognizes. If the sign predicted is the same
into speech. Since the speech is made audible using speaker, for 16 frames continuously, that sign is stored in memory.
the surrounding people can understand and respond to the Fig.10 shows the word ”CALL” recognized by the system
sign language used by the person(deaf-mute) carrying this corresponding to the signs performed by the user. After a
prototype. The performance of the prototype depends on the word is stored, it is passed on to the text-to-speech module
number frames captured by camera per unit time in order to which outputs speech through the speaker connected to the
recognize the sign as efficiently as possible. The performance controller as shown in the Fig.11
of the image recognition software depends completely on the The user of this prototype must perform the unique hand
specification of convolutional layer and the dataset. gestures as shown in Fig.1 one after the other to convey the
We have created the dataset for training the CNN model word of his/her interest. This prototype predicts the alphabet
used in this Sign Language Converter. The dataset consists for each gesture one by one and a word is formed. This word
of around 35000 images with around 1200 images per sign is then converted into speech signal(see Fig.11), making it
was used to train the CNN model based on the fact that audible to the others.
larger the dataset the more will be the efficiency of the Confusion matrix shown in Fig.12 is a plot that is used for
model. The CNN model of proposed system is trained on describing the performance of the CNN model. Each row of
80% of the dataset and the remaining 20% of data is used the confusion matrix shows the true label while each column
for cross validation. The model is trained for 20 epoch and shows the instances of predicted label. For example, Row 3
the accuracy of the trained model is observed to be 99% as of this confusion matrix shows that out of 208 images of
plotted in Fig.8 C used for testing the model, 207 are correctly predicted as
C(column 3) and one is incorrectly predicted as E(column 5).
This low cost and small sized sign language converter can be
reprogrammed for different types of signs.The prototype is
designed keeping its portability into account. The Raspberry
Pi3 is the heaviest module with weight of 45.4 gms with a
306
[6] S. K. Yewale, P. K. Bharne “Hand gesture recognition using different
algorithms based on artificial neural network” International Confer-
ence on Emerging Trends in Networks and Computer Communications
(ETNCC), 2011
[7] Sanjay Meena ”A Study on Hand Gesture Recognition Technique”,
M.Tech Thesis, National Institute Of Technology, Rourkela Orissa,
INDIA 2011
[8] Vladimir I. Pavlovic, Rajeev Sharma, Member and Thomas S. Huang
“Visual Interpretation of Hand Gestures for Human-Computer Interac-
tion: A Review” IEEE Transactions on Pattern Analysis and Machine
Intelligence, VOL. 19, NO. 7, JULY 1997
[9] http://www.deaftravel.co.uk/signprint.php?id=27
[10] Vladislava Bobić, Predrag Tadić, Goran Kvašče “Hand gesture recog-
nition using neural network based techniques” 2016 13th Symposium
on Neural Networks and Applications(NEUREL)
[11] Activation Functions: Neural Networks Sigmoid, tanh, Softmax,
ReLU, Leaky ReLU EXPLAINED !!! by SAGAR SHARMA [2017]
- towards data science
[12] K. Hara, K. Nakayamma ”Comparison of activation functions in
multilayer neural network for pattern classification” IEEE International
Conference on Neural Networks (ICNN’94), 1994
[13] Kashmera Khedkkar Safaya1, Prof.(DR.). J.W.Bakal2 ”Real Time
Based Bare Hand Gesture Recognition” IPASJ International Journal
of Information Technology (IIJIT) Volume 1, Issue 2, July 2013
[14] S. Kolkur , D. Kalbande , P. Shimpi , C. Bapat , and J. Jatakia
Fig. 12: Confusion matrix without normalization “Human Skin Detection Using RGB, HSV and YCbCr Color Models”
Advances in Intelligent Systems Research 2017
IX. CONCLUSIONS
People with hearing impairment face many problems
in their day-to-day life and are unable to lead a normal
life. They have to be dependent on family members or
an interpreter to do jobs like traveling through a public
transport, at school or work place as the essential facilities
are not available for them in India. The system designed is a
portable real-time sign language translator which can assist
them anywhere and help them communicate with the normal
people at work place or any public place. This portable
assistive helps them to take part in various activities and
will be treated equally like any another normal person. This
proposed system takes input gesture through the camera,
translates into alphabet and spells it out after forming the
letter sequence into word. This sign language converter is
found to be 99% accurate in recognizing the signs and
generating the correct words.
R EFERENCES
[1] http://www.who.int/mediacentre/factsheets/fs300/en/
[2] Eun-Jung Park Namita Sehdev Roger Frogoso “American Sign Lan-
guage Translator using Gesture Recognition” The City College of New
York
[3] Ranjay Krishna, Seonwoo Lee, Si Ping Wang, Jonathan Lang “The
Sign Language Translator- The sound of Signing” Cornell University
2012
[4] Sana’a Khudayer Jadwaa “Feature Extraction for Hand Gesture Recog-
nition : A Review” International Journal of Scientific & Engineering
Research, Volume 6, Issue 7, July-2015
[5] Keerthi S Warrier, Jyateen Kumar Sahu, Himadri Halder, Rajkumar
Koradiya V Karthik Raj Biomedical Engineering Department, SRM
University, Chennai, India “Software based sign language converter”
IEEE International Conference on Communication and Signal Pro-
cessing (ICCSP), 2016
307