Sign Language Recognition System With Speech Output
Sign Language Recognition System With Speech Output
Sign Language Recognition System With Speech Output
ISSN No:-2456-2165
Abstract - Indian deaf and mute community is troubled or any other facilities which can support a conversation
by a lack of capable sign language volunteers and this between these two individuals. Greater challenges lie when
lack is reflected in the relative isolation of the a sign language speaker has to interact with a large
community. There have been many techniques used to audience. Sensor-based methods provide feasible solutions,
apply machine learning and computer vision principles but wearing extra equipment on hands is inconvenient to
to recognize sign language inputs in real-time. CNN is people[7].
identified as one of the more accurate of these
techniques. This project aims to use CNN to classify and Roughly 28 million people in India suffer from some
recognize sign language inputs produce text-to-speech level of hearing loss, says research by Centres for Disease
outputs. In doing so, sign language speakers will be able Control and Prevention. A small percentage of these people
to communicate to larger audiences without worrying may not use sign language at all, instead using methods like
about the speed a human interpreter. OpenCV is used lip reading. A person who communicates using lip reading
for live detection of the hand gestures performed by the can have a more active participation in real world
user. We are using Indian sign language to train the conversations. On the other hand, those who only use sign
model. After recognition of a particular gesture, it will language are limited to carrying out conversations with
convert the predicted text to speech so that other person fellow sign language speakers only. Therefore, they require
can hear what the user want to convey the message. interpreters to engage fully in real world conversations.
IJISRT19MA247 www.ijisrt.com 89
Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
demonstrate that by using simple hand gestures the way the computer vision serve as a helping tool for
corresponding letters can be predicted and obtained as healthcare. Our work is based on acquiring visual input and
output. processing it further. There are a number of libraries
available that facilitate use of computer vision such as
III. PROPOSED WORK OpenCV, PCL (Point Cloud Library), ROS for robotic
vision, and MATLAB. We have used OpenCV for its
The input will be a webcam live feed of a human seamless integration into Python programs. Training and
speaking in sign language. The waist-up portion will be classification in OpenCV is done with the help of neural
within the frame to maximize the variety of signs taken as networks, which have shown to have an accuracy of as
input. The challenge we face is the application of a suitable much as 94.2% in [1]. Neural networks used for computer
method in computer vision that can detect the smallest vision applications require a lot of high-quality data. The
discrepancy between similar signs to avoid inaccurate algorithms need a lot of data which should be related to the
translation in real time. We also need to train a separate project, this will produce good results. Images are available
dataset for random gestures and movement to ensure they online in large quantities, but the solution to many real-
are not translated to convey meaning. Next, we need to world problems needs quality labelled training data, this
instantly display captions which are a result of the makes it expensive because the labelling has to be done
translation of sign language inputs, on the screen to give manually.
extra assistance to the audience. Since our final goal is to
give a sound output for the aforementioned translation of C. Convolutional Neural Networks(CNN)
sign language, we need to employ a suitable text-to speech A Convolutional Neural Network is one of the types
software or a web API that can convert this input into of artificial neural network(ANN)[2] which makes the use
sound output in the most human way possible to avoid a of convolutional layers for filtering inputs and extraction of
robot-like sound output. information. The convolution function combines input data
also known as feature map with a convolution kernel(filter)
A. Sign Language to get a transformed feature map. The filters in
Humans are social animals who have communication convolutional layers are transformed according to learned
at the core of their characteristics. Languages have helped parameters to extract the most relevant information for a
us evolve as a species. One such language is the sign given task. Convolutional neural networks make automatic
language. Sign language is essentially a means of adjustments to find the most suitable features required by
communication used to convey meaning through the use of as task. The Neural network would filter information
gestures and symbols. Over the world, there are different related to the shape of an object when assigned an object
versions of sign language depending on that region. [5]. recognition task but will extract the colour of the animal
ISL (Indian Sign Language) consists of both isolated sign when assigned an animal recognition task. This is based on
as well as continuous signs An isolated sign refers to a the ability of the Convolutional Neural Network to
single hand gesture, and a continuous sign is a moving understand that different classes of objects will have
gesture represented as a series of images. It helps them gain different shapes but that different types of animals will
knowledge and also gives them a real chance at having a likely differ in colour than in shape. Convolutional Neural
formal education. In India, the lack of knowledge about Networks have various applications which include natural
sign language has caused speech and hearing impaired language processing, sentiment analysis, image
people to drop out of the education system rapidly. This is classification, image recognition, video analysis, text
something we aim at changing. The fight to make Indian analysis, speech recognition, text classification processing
sign language official has been encouraging lately as ‘The applications, with Artificial Intelligence systems such as
First Indian Sign Language Dictionary of 3000 Words’ was virtual assistants, robots, and autonomous cars, drones, and
launched in march 2018. We realised that some level of manufacturing machines. CNN has been important in our
automation is needed in this role to help the deaf and mute implementation being useful for training our dataset of
gain access to education and further improve their lives. multiple gestures. The CNN consists of three layers namely
an input layer, output layer and a hidden layer which
B. Computer Vision consists of the convolutional layers, pooling layers, fully
Computer Vision is at the crux of our work, it is a connected layers and normalization layers.
technology that allows computers to perceive the outside
world through the lens of a camera and make logical D. Technical Details
interpretations. Computer Vision has evolved rapidly since The need to derive a solution to the problem that is
its applications and potential were noticed shortly after its lack of sign language volunteers encouraged us to make a
introduction. Computer Vision aided machine learning is system that can take signs as an input, analyse it with the
being used to solve a massive number of real world dataset, generate a string output for a specific sign using it
problems such as automated driving, surveillance, object to generate a speech output. We have included functionality
detection, facial recognition, gesture detection and so on. to enable users to train their own gestures.
Computer vision and health care has the potential to deliver
real value, while the computers wont replace the healthcare We used Python for the programming part along with
personnel, but there is a possibility to improve routine OpenCV, and pyttsx library for real-time speech generation
diagnostics that require a lot of time and expertise.In this
IJISRT19MA247 www.ijisrt.com 90
Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Fig 1
Data Acquisition First Layer - The first layer has 32 filters for building
The form of data that needs to be processed and hence the feature maps, each feature map is of 3x3 size. The
acquired is image data which can easily be captured by a input shape is set to 64x64 size and the activation
webcam like [2] with the help of OpenCV functions which function used is RELU (Rectifier Linear Unit). The
allow live video to be broken down into single frames maximum pooling size is set to 2x2.
representing individual images. [3] have used a 3d Model Second Layer - This layer is similar to the first layer
approach We have captured our own dataset for a particular same filters and feature maps and relu activation
gesture, so that user can set his own gesture for different function.
words and they do not have to remember the standard Third Layer - In this layer there are 64 filters and 3x3
gestures. To enhance the specificity of acquired data, we sized feature map with relu activation function and
can use skeletonization and background reduction maxpool size 2x2. After this the flatten layer is added
techniques. Skeletonization reduces the main subject of the and full connection layer contains two dense layers with
image into a skeleton-like frame joined at points. 256 filters. The second dense layer contains the softmax
Background reduction is used to remove redundant data function because there are more than two types of
that is interfering with the subject and causing any outputs to predict.
inefficiency. These images are converted to numerical form
using the numpy library that facilitates mathematical Recognition
functions finally converting each frame into a matrix There are many different ways for performing image
containing values for each pixel in the image. recognition, but many of the top techniques involve the use
of convolutional neural networks. The CNN can
Training CNN model specifically set for image processing and recognition.
We trained CNN model using Keras deep learning Using a combination of techniques such as max pooling,
library, Keras is an open source neural network library stride configuration and padding, convolutional neural
which uses tensorflow in its background for processing. filters work on images to help machine learning programs
Keras allows us to easily experiment and test the neural identify the subject of the picture. Once the training is
networks models just by tuning some input values. We complete, the application will be ready for recognition of
have built total three convolution layers. [4] Each gesture sign language input. The input is taken via a webcam. This
in the dataset is an image of 240x320 pixels, these images is a continuous video input, each frame of which OpenCV
are converted in the form of matrix.The columns of the treats as an individual image and returns string output
matrix represents the pixels of the image and range from 0 based on previously trained gestures in the dataset.
to 255.
IJISRT19MA247 www.ijisrt.com 91
Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Speech
To enable a sign language speaker to talk to an
audience, a speech synthesis mechanism would enhance the
effectiveness of the conversation. For this, we need a text-
to-speech engine. There many popular test-to-speech
libraries available such as Google’s “gtts” (Google text-to-
speech), pysttx3, and multiple web speech APIs. For this
project, we have used pyttsx3 library which helped us
achieve real-time speech synthesis.
IV. RESULTS
To set the raange of the hsv values the user just have
to slide the bars and check the effectiveness of the
corresponding values. According to the given lower limits
and upper limits of the HSV values, the background noise
will be reduced.
Fig 2
Fig 6
IJISRT19MA247 www.ijisrt.com 92
Volume 4, Issue 3, March – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. CONCLUSION
REFERENCES
IJISRT19MA247 www.ijisrt.com 93