The signlanguage_detection from tshjustin

Detecting Sign Language using action detection

Overview

This project entails the reading of sign language given an action by a user. The action is mapped to a dictionary that contain the meaning (word) of the action. The model is used is a LSTM Neural Network due to the sequential nature of the data. Each action contains a pre-defined number of frames of 30, which contains landmarks that are detected and stored MediaPipe as Numpy arrays.

Why LSTM

Since we are dealing with a sequential form of data, in the form of frames in sequence. A simple RNN was used but yielded poor results during inference as weights are not updated optimally due to long sequences. Thus encountering the vanishing gradient problem and the model performs poorly.

Process

Data Preparation: Using Mediapipe and OpenCV, individual actions are recorded and stored as frames. For sufficient training data, each action consist of 30 videos (Sequences). Each sequence are split into 30 frames, with each frame being represented as a numpy array with all landmarks (Face, limbs, fingers, etc). After capturing the actions, the arrays are then labelled with their corresponding action names for prediction after training.
The sequences of data are partitioned and fed into the model for training and testing . The model compromises of 3 LSTM and Dense layers, with a final softmax layer for prediction of action with the highest probability. The trained model is then exported as in .h5 format and loaded for inference.
During inference, the probability distribution of each word is shown on the screen to monitor the model's accuracy. Words are also recorded to a sentence to keep track of its past predicted actions.

Inference

Evaluation of Model

Dependency Notes

To view tensorboard, downgrade protobuf from 3.20.3 to 3.20.1 via pip install --upgrade protobuf==3.20.1. Tensorboard can be brought up using tensorboard --logdir=.

Mediapipe

Mediapipe provides trained ML models for building pipelines to perform computer vision inference over arbitrary sensory data such as video or audio. In this case, it was used to detect facial and body features and outputted in the form of landmarks that are easily manipulated.

Further Notes

In the future, I hope to extend this pipeline to be finedtuned to a specific sign language. Given multiple sign languages in the world such as the American Sign Language, Spanish Sign Language etc, finetuning is required to encompass different gestures.

Limitation

Each action is currently mapped to a label, and each frame is independent of the other. In the real world where facial features, transitions between actions are also essential in building the context of the sentence, higher level of details are required to be captured. More complex architecture are needed to capture such spatial data such as Transformers and build up the correct context of a sentence.

tshjustin / signlanguage_detection Goto Github PK

signlanguage_detection's Introduction

Detecting Sign Language using action detection

Overview

Why LSTM

Process

Inference

Evaluation of Model

Dependency Notes

Mediapipe

Further Notes

Limitation

signlanguage_detection's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent