Sign Language Recognition System With Speech Output

You are on page 1of 5

Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Sign Language Recognition System with Speech


Output
Vinayak S. Kunder Aakash A. Bhardwaj Vipul D. Tank
Dept. Student of Computer Engineering Dept. Student of Computer Engineering Dept. Student of Computer Engineering
Pillai HOC College of Engineering and Pillai HOC College of Engineering and Pillai HOC College of Engineering and
Technology Technology Technology
Rasayani, India Rasayani, India Rasayani, India

Abstract - Indian deaf and mute community is troubled or any other facilities which can support a conversation
by a lack of capable sign language volunteers and this between these two individuals. Greater challenges lie when
lack is reflected in the relative isolation of the a sign language speaker has to interact with a large
community. There have been many techniques used to audience. Sensor-based methods provide feasible solutions,
apply machine learning and computer vision principles but wearing extra equipment on hands is inconvenient to
to recognize sign language inputs in real-time. CNN is people[7].
identified as one of the more accurate of these
techniques. This project aims to use CNN to classify and Roughly 28 million people in India suffer from some
recognize sign language inputs produce text-to-speech level of hearing loss, says research by Centres for Disease
outputs. In doing so, sign language speakers will be able Control and Prevention. A small percentage of these people
to communicate to larger audiences without worrying may not use sign language at all, instead using methods like
about the speed a human interpreter. OpenCV is used lip reading. A person who communicates using lip reading
for live detection of the hand gestures performed by the can have a more active participation in real world
user. We are using Indian sign language to train the conversations. On the other hand, those who only use sign
model. After recognition of a particular gesture, it will language are limited to carrying out conversations with
convert the predicted text to speech so that other person fellow sign language speakers only. Therefore, they require
can hear what the user want to convey the message. interpreters to engage fully in real world conversations.

Keywords:- Sign-language, OpenCV, Gestures, II. LITERATURE SURVEY


Convolutional Neural Networks(CNN).
A. Suharjitoa, Ricky Anderson , Fanny Wiryana , Meita
I. INTRODUCTION Chandra Ariesta , Gede Putra Kusumaa, Sign Language
Recognition Application Systems for Deaf-Mute
Communication is one of the reasons for the rapid People: A Review Based on Input-Process-Output, 2nd
progress that mankind has been making in the fields of art, International Conference on Computer Science and
architecture, music, sports, science, technology, drama and Computational Intelligence 2017, ICCSCI 2017, Bali,
so on. It lies at the root of progress and is the most essential Indonesia.: This paper studies the various phases of our
tool for survival as a group and society. Most of us are intended application such as data acquisition, image
fortunate enough to have the ability to communicate processing, other suggested classification methods and
verbally, while others are not so fortunate being vocally their comparison with Convolutional Neural Networks.
challenged and hearing impaired. Sign language comprises The accuracy result for classification of CNN has
lots of complex hand movements, and every tiny posture shown to be 94.2% which encouraged us to do the same
can have a variety of possible meanings[6].The way they with similar results in our implementation of CNN.
communicate has been evolving from pointing out in B. Pratibha Pandey, Vinay Jain, "Hand Gesture
directions to certain objects to having a specific set of Recognition for Sign Language: A review,", IJSETR,
gestures to convey meaning in the form of sign language. Vol.4, Issue 3, March 2015: The objective of this paper
The people who are vocally challenged and hearing is to highlight widely effective methods of capturing
impaired use different forms of sign language all over the gestures which have been fundamental in the recent
world. This helps them get the education and learning past.
equivalent to their more gifted peers. While intellectually C. Manisha U. Kakde, Mahender G. Nakrani, Amit M
unaffected by their ability to speak or hear, the sign Rawate, “A Review Paper on Sign Language
language community all over the world is still functioning Recognition For Deaf And Dumb People using Image
within relative isolation. The Indian Sign Language has Processing,” ,IJERT, ISSN:2278-0181, Vol.5 Issue 03,
about 2.7 million speakers and roughly about 300 known March 2016: This paper lists current most popular
volunteering interpreters. It is not feasible for such a low methods of sign acquisition.
number of interpreters to work with that many speakers. It D. Shreyasi Narayan Sawant, M. S. Kumbhar, “Real Time
becomes very challenging for a sign language speaker to SignLanguage Recognition using PCA”, 2014 IEEE
interact with someone outside the sign language conference on Advanced Communication Control and
community since there is a lack of volunteering interpreters Computing Technologies: The results of this paper

IJISRT19MA247 www.ijisrt.com 89
Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
demonstrate that by using simple hand gestures the way the computer vision serve as a helping tool for
corresponding letters can be predicted and obtained as healthcare. Our work is based on acquiring visual input and
output. processing it further. There are a number of libraries
available that facilitate use of computer vision such as
III. PROPOSED WORK OpenCV, PCL (Point Cloud Library), ROS for robotic
vision, and MATLAB. We have used OpenCV for its
The input will be a webcam live feed of a human seamless integration into Python programs. Training and
speaking in sign language. The waist-up portion will be classification in OpenCV is done with the help of neural
within the frame to maximize the variety of signs taken as networks, which have shown to have an accuracy of as
input. The challenge we face is the application of a suitable much as 94.2% in [1]. Neural networks used for computer
method in computer vision that can detect the smallest vision applications require a lot of high-quality data. The
discrepancy between similar signs to avoid inaccurate algorithms need a lot of data which should be related to the
translation in real time. We also need to train a separate project, this will produce good results. Images are available
dataset for random gestures and movement to ensure they online in large quantities, but the solution to many real-
are not translated to convey meaning. Next, we need to world problems needs quality labelled training data, this
instantly display captions which are a result of the makes it expensive because the labelling has to be done
translation of sign language inputs, on the screen to give manually.
extra assistance to the audience. Since our final goal is to
give a sound output for the aforementioned translation of C. Convolutional Neural Networks(CNN)
sign language, we need to employ a suitable text-to speech A Convolutional Neural Network is one of the types
software or a web API that can convert this input into of artificial neural network(ANN)[2] which makes the use
sound output in the most human way possible to avoid a of convolutional layers for filtering inputs and extraction of
robot-like sound output. information. The convolution function combines input data
also known as feature map with a convolution kernel(filter)
A. Sign Language to get a transformed feature map. The filters in
Humans are social animals who have communication convolutional layers are transformed according to learned
at the core of their characteristics. Languages have helped parameters to extract the most relevant information for a
us evolve as a species. One such language is the sign given task. Convolutional neural networks make automatic
language. Sign language is essentially a means of adjustments to find the most suitable features required by
communication used to convey meaning through the use of as task. The Neural network would filter information
gestures and symbols. Over the world, there are different related to the shape of an object when assigned an object
versions of sign language depending on that region. [5]. recognition task but will extract the colour of the animal
ISL (Indian Sign Language) consists of both isolated sign when assigned an animal recognition task. This is based on
as well as continuous signs An isolated sign refers to a the ability of the Convolutional Neural Network to
single hand gesture, and a continuous sign is a moving understand that different classes of objects will have
gesture represented as a series of images. It helps them gain different shapes but that different types of animals will
knowledge and also gives them a real chance at having a likely differ in colour than in shape. Convolutional Neural
formal education. In India, the lack of knowledge about Networks have various applications which include natural
sign language has caused speech and hearing impaired language processing, sentiment analysis, image
people to drop out of the education system rapidly. This is classification, image recognition, video analysis, text
something we aim at changing. The fight to make Indian analysis, speech recognition, text classification processing
sign language official has been encouraging lately as ‘The applications, with Artificial Intelligence systems such as
First Indian Sign Language Dictionary of 3000 Words’ was virtual assistants, robots, and autonomous cars, drones, and
launched in march 2018. We realised that some level of manufacturing machines. CNN has been important in our
automation is needed in this role to help the deaf and mute implementation being useful for training our dataset of
gain access to education and further improve their lives. multiple gestures. The CNN consists of three layers namely
an input layer, output layer and a hidden layer which
B. Computer Vision consists of the convolutional layers, pooling layers, fully
Computer Vision is at the crux of our work, it is a connected layers and normalization layers.
technology that allows computers to perceive the outside
world through the lens of a camera and make logical D. Technical Details
interpretations. Computer Vision has evolved rapidly since The need to derive a solution to the problem that is
its applications and potential were noticed shortly after its lack of sign language volunteers encouraged us to make a
introduction. Computer Vision aided machine learning is system that can take signs as an input, analyse it with the
being used to solve a massive number of real world dataset, generate a string output for a specific sign using it
problems such as automated driving, surveillance, object to generate a speech output. We have included functionality
detection, facial recognition, gesture detection and so on. to enable users to train their own gestures.
Computer vision and health care has the potential to deliver
real value, while the computers wont replace the healthcare We used Python for the programming part along with
personnel, but there is a possibility to improve routine OpenCV, and pyttsx library for real-time speech generation
diagnostics that require a lot of time and expertise.In this

IJISRT19MA247 www.ijisrt.com 90
Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig 1

 Data Acquisition  First Layer - The first layer has 32 filters for building
The form of data that needs to be processed and hence the feature maps, each feature map is of 3x3 size. The
acquired is image data which can easily be captured by a input shape is set to 64x64 size and the activation
webcam like [2] with the help of OpenCV functions which function used is RELU (Rectifier Linear Unit). The
allow live video to be broken down into single frames maximum pooling size is set to 2x2.
representing individual images. [3] have used a 3d Model  Second Layer - This layer is similar to the first layer
approach We have captured our own dataset for a particular same filters and feature maps and relu activation
gesture, so that user can set his own gesture for different function.
words and they do not have to remember the standard  Third Layer - In this layer there are 64 filters and 3x3
gestures. To enhance the specificity of acquired data, we sized feature map with relu activation function and
can use skeletonization and background reduction maxpool size 2x2. After this the flatten layer is added
techniques. Skeletonization reduces the main subject of the and full connection layer contains two dense layers with
image into a skeleton-like frame joined at points. 256 filters. The second dense layer contains the softmax
Background reduction is used to remove redundant data function because there are more than two types of
that is interfering with the subject and causing any outputs to predict.
inefficiency. These images are converted to numerical form
using the numpy library that facilitates mathematical  Recognition
functions finally converting each frame into a matrix There are many different ways for performing image
containing values for each pixel in the image. recognition, but many of the top techniques involve the use
of convolutional neural networks. The CNN can
 Training CNN model specifically set for image processing and recognition.
We trained CNN model using Keras deep learning Using a combination of techniques such as max pooling,
library, Keras is an open source neural network library stride configuration and padding, convolutional neural
which uses tensorflow in its background for processing. filters work on images to help machine learning programs
Keras allows us to easily experiment and test the neural identify the subject of the picture. Once the training is
networks models just by tuning some input values. We complete, the application will be ready for recognition of
have built total three convolution layers. [4] Each gesture sign language input. The input is taken via a webcam. This
in the dataset is an image of 240x320 pixels, these images is a continuous video input, each frame of which OpenCV
are converted in the form of matrix.The columns of the treats as an individual image and returns string output
matrix represents the pixels of the image and range from 0 based on previously trained gestures in the dataset.
to 255.

IJISRT19MA247 www.ijisrt.com 91
Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Speech
To enable a sign language speaker to talk to an
audience, a speech synthesis mechanism would enhance the
effectiveness of the conversation. For this, we need a text-
to-speech engine. There many popular test-to-speech
libraries available such as Google’s “gtts” (Google text-to-
speech), pysttx3, and multiple web speech APIs. For this
project, we have used pyttsx3 library which helped us
achieve real-time speech synthesis.

IV. RESULTS

The training accuracy we obtained 99.93% and the


data loss was 0.339.

The following images are the outputs of the program:


Fig 4

To set the raange of the hsv values the user just have
to slide the bars and check the effectiveness of the
corresponding values. According to the given lower limits
and upper limits of the HSV values, the background noise
will be reduced.

Fig 2

The above image shows the prediction of the given


gesture input which was set by the user. The input frame is
converted to the HSV form and the hsv values can be set by
the user according to his/her background condition.
Fig 5

Fig 6

Fig 3 Above is the graph of the model accuracy and model


loss while training the CNN model.

IJISRT19MA247 www.ijisrt.com 92
Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

V. CONCLUSION

The application of this concept can empower and give


a voice to the sign language community, they can convey
their thoughts to the normal people and lack of human
translator will not be felt.

REFERENCES

[1]. Suharjitoa, Ricky Anderson , Fanny Wiryana , Meita


Chandra Ariesta , Gede Putra Kusumaa, Sign
Language Recognition Application Systems for Deaf-
Mute People: A Review Based on Input-Process-
Output, 2nd International Conference on Computer
Science and Computational Intelligence 2017, ICCSCI
2017, Bali, Indonesia.
[2]. Manisha U. Kakde, Mahender G. Nakrani, Amit M
Rawate, “A Review Paper on Sign Language
Recognition For Deaf And Dumb People using Image
Processing,” ,IJERT, ISSN:2278-0181, Vol.5 Issue 03,
March 2016
[3]. Pratibha Pandey, Vinay Jain, "Hand Gesture
Recognition for Sign Language: A review,", IJSETR,
Vol.4, Issue 3, March 2015.
[4]. Shreyasi Narayan Sawant, M. S. Kumbhar, "Real Time
SignLanguage Recognition using PCA", 2014 IEEE
conference on Advanced Communication Control and
Computing Technologies.
[5]. Karishma Dixit, Anand Singh Jalal,”Automatic Indian
Sign Language Recognition System”, IEEE 2012.
[6]. Jie Huang, Wengang Zhou, Houqiang Li, and Weiping
li,sign language recognition using real-sense
[7]. Wen Gao, Gaolin Fang, Debin Zhao, and Yiqiang
Chen,“A chinese sign language recognition system
based on sofm/srn/hmm,” Pattern Recognition, vol. 37,
no. 12, pp. 2389–2402, 2004.

IJISRT19MA247 www.ijisrt.com 93

You might also like