whatsapp

whatsApp

Have any Questions? Enquiry here!
☎ +91-9972364704 LOGIN BLOG
× Home Careers Contact
Back
Real-Time Sign Language Gesture (Word) Recognition from Video Sequences Using CNN and RNN
Real-Time Sign Language Gesture (Word) Recognition from Video Sequences Using CNN and RNN

Real-Time Sign Language Gesture (Word) Recognition from Video Sequences Using CNN and RNN

Abstract

There is a need of a method or an application that can recognize sign language gestures so that the communication is possible even if someone does not understand sign language. With this work, we intend to take a basic step in bridging this communication gap using Sign Language Recognition. Video sequences contain both the temporal and the spatial features. To train the model on spatial features, we have used inception model which is a deep convolutional neural network (CNN) and we have used recurrent neural network (RNN) to train the model on temporal features. Our dataset consists of Argentinean Sign Language (LSA) gestures, belonging to 46 gesture categories. The proposed model was able to achieve a high accuracy of 95.2% over a large set of images.

Introduction

Sign language is a vision-based language which uses an amalgamation of variety of visuals like hand shapes and gestures, orientation, locality and movement of hand and body, lip movement and facial expressions. Like the spoken language, regional variants of sign language also exist, e.g., Indian Sign language (ISL), American Sign Language (ASL), and Portuguese Sign Language. There are three types of sign languages: spelling each alphabet using fingers, sign vocabulary for words, using hands and body movement, facial expressions, and lip movement. Sign language can also be isolated as well as continuous. In isolated sign language, people communicate using gestures of single word, while continuous sign language is a sequence of gestures that generate a meaningful sentence. All the methods for recognizing hand gestures can be broadly classified as vision-based and based on measurements made by sensors in gloves. 

In this work, we attempt to perform recognition on isolated sign language with the vision-based method. Unlike other works, we chose a dataset with larger gesture variants and significant video samples so that the resultant model having better generalization capabilities. In this work, we also attempt to explore the possibilities of exploiting the benefits of RNN in performing gesture recognition.

Algorithms Used

Video classification is a challenging problem as a video sequence contains both the temporal and the spatial features. Spatial features are extracted from the frames of the video, whereas the temporal features are extracted by relating the frames of video in a course of time. We have used two types of learning networks to train our model on each type of features. To train the model on spatial features, we have used CNN, and for the temporal features we have used recurrent neural network.

Convolutional Neural Network

Convolutional neural network or ConvNets are great at capturing local spatial patterns in the data. They are great at finding patterns and then use those to classify images. ConvNets explicitly assume that input to the network will be an image. CNNs, due to the presence of pooling layers, are insensitive to rotation or translation of two similar images; i.e., an image and its rotated image will be classified as the same image. Due to the vast advantages of CNN in extracting the spatial features of an image, we have used Inception-v3 [9] model of the TensorFlow [10] library which is a deep ConvNet to extract spatial features from the frames of video sequences. Inception is a huge image classification model with millions of parameters for images to classify.

Recurrent Neural Network

There is information in the sequence itself, and recurrent neural networks (RNNs) use this for the recognition tasks. The output from an RNN depends on the combination of current input and previous output as they have loops. One drawback of RNN is that, in practice, RNNs are not able to learn long-term dependencies . Hence, our model used Long Short-Term Memory (LSTM) , which is a variation of RNN with LSTM units. LSTMs can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences . The first layer is to feed input to the upcoming layers whose size is determined by the size of the input. Our model is a wide network consisting of single layer of 256 LSTM units. This layer is followed by a fully connected layer with softmax activation. Finally, a regression layer is applied to perform a regression to the provided input. 

Real-Time Sign Language Gesture (Word) Recognition from Video Sequences  Using CNN and RNN | SpringerLink

Methodology

Two approaches were used to train the model on the temporal and the spatial features, and both differ by the way inputs given to RNN to train it on the temporal features. 

  • Prediction Approach
  • Pool Layer Approach

Real-Time Sign Language Gesture (Word) Recognition from Video Sequences  Using CNN and RNN | SpringerLink

Conclusion

In this work, we presented a vision-based system to interpret isolated hand gestures from the Argentinean Sign Language. This work used two different approaches to classify on the spatial and temporal features. CNN was used to classify on the spatial features, whereas RNN was used to classify on the temporal features. We obtained an accuracy of 95.217%. This shows that CNN along with RNN can be successfully used to learn spatial and temporal features and classify sign language gesture videos.

 

Popular Coures