Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition

Huang, Jie; Zhou, Wengang; Li, Houqiang; Li, Weiping

doi:10.1109/tcsvt.2018.2870740

Cited by 133 publications

(115 citation statements)

References 35 publications

Supporting

Mentioning

103

Contrasting

Order By: Relevance

“…If action recognition is performed on raw video data authors prefer to use convolution neural networks where convolution layer is used to generate features. Those features are then processed by a fully connected neural network which performs classification [16]. Sometimes an input raw signal is processed by convolution layer followed by recurrent network to avoid a sliding window design and then classified by a fully connected neural network [17].…”

Section: Effective Methods Of Human Motion Analysis and Classificationmentioning

confidence: 99%

Evaluation of Pattern Recognition Methods for Head Gesture-Based Interface of a Virtual Reality Helmet Equipped with a Single IMU Sensor

Hachaj

Piekarczyk

2019

Sensors

View full text Add to dashboard Cite

The motivation of this paper is to examine the effectiveness of state-of-the-art and newly proposed motion capture pattern recognition methods in the task of head gesture classifications. The head gestures are designed for a user interface that utilizes a virtual reality helmet equipped with an internal measurement unit (IMU) sensor that has 6-axis accelerometer and gyroscope. We will validate a classifier that uses Principal Components Analysis (PCA)-based features with various numbers of dimensions, a two-stage PCA-based method, a feedforward artificial neural network, and random forest. Moreover, we will also propose a Dynamic Time Warping (DTW) classifier trained with extension of DTW Barycenter Averaging (DBA) algorithm that utilizes quaternion averaging and a bagged variation of previous method (DTWb) that utilizes many DTW classifiers that perform voting. The evaluation has been performed on 975 head gesture recordings in seven classes acquired from 12 persons. The highest value of recognition rate in a leave-one-out test has been obtained for DTWb and it equals 0.975 (0.026 better than the best of state-of-the-art methods to which we have compared our approach). Among the most important applications of the proposed method is improving life quality for people who are disabled below the neck by supporting, for example, an assistive autonomous power chair with a head gesture interface or remote controlled interfaces in robotics.

show abstract

Section: Effective Methods Of Human Motion Analysis and Classificationmentioning

confidence: 99%

Evaluation of Pattern Recognition Methods for Head Gesture-Based Interface of a Virtual Reality Helmet Equipped with a Single IMU Sensor

Hachaj

Piekarczyk

2019

Sensors

View full text Add to dashboard Cite

show abstract

“…These models learn the relevant spatial or temporal parts of the image or video automatically from data. These models have also been used in the SLR domain [2], [8], [34], [36], [40].…”

Section: Related Workmentioning

confidence: 99%

“…Then LSTM is used to model the temporal characteristics of the stream. In the recent years, some studies use 3D-CNNs in order to capture spatial-temporal features together [2], [3], [37]. In [3], pose based and visual appearance based approaches are compared.…”

Section: A Sign Language Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods

2020

View full text Add to dashboard Cite

Sign language recognition is a challenging problem where signs are identified by simultaneous local and global articulations of multiple sources, i.e. hand shape and orientation, hand movements, body posture, and facial expressions. Solving this problem computationally for a large vocabulary of signs in real life settings is still a challenge, even with the state-of-the-art models. In this study, we present a new largescale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark and provide baseline models for performance evaluations. Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples in total. Samples contain a wide variety of backgrounds recorded in indoor and outdoor environments. Moreover, spatial positions and the postures of signers also vary in the recordings. Each sample is recorded with Microsoft Kinect v2 and contains color image (RGB), depth, and skeleton modalities. We prepared benchmark training and test sets for user independent assessments of the models. We trained several deep learning based models and provide empirical evaluations using the benchmark; we used Convolutional Neural Networks (CNNs) to extract features, unidirectional and bidirectional Long Short-Term Memory (LSTM) models to characterize temporal information. We also incorporated feature pooling modules and temporal attention to our models to improve the performances. We evaluated our baseline models on AUTSL and Montalbano datasets. Our models achieved competitive results with the state-of-the-art methods on Montalbano dataset, i.e. 96.11% accuracy. In AUTSL random train-test splits, our models performed up to 95.95% accuracy. In the proposed user-independent benchmark dataset our best baseline model achieved 62.02% accuracy. The gaps in the performances of the same baseline models show the challenges inherent in our benchmark dataset. AUTSL benchmark dataset is publicly available at https://cvml.ankara.edu.tr.

show abstract

“…LS-HAN contains three components, namely, two-stream Convolutional Neural Network (CNN) for video feature representation, a Latent Space (LS) to bridge semantic gap, and a Hierarchical Attention Network (HAN) for recognition. Huang et al [34] presented an attention-based 3D-convolutional neural networks (3D-CNNs). This model can learn spatial and temporal features from raw video and the attention mechanism helps to focus on the areas of interest.…”

Section: Related Workmentioning

confidence: 99%

DeepArSLR: A Novel Signer-Independent Deep Learning Framework for Isolated Arabic Sign Language Gestures Recognition

Aly

2020

IEEE Access

View full text Add to dashboard Cite

Hand gesture recognition has attracted the attention of many researchers due to its wide applications in robotics, games, virtual reality, sign language and human-computer interaction. Sign language is a structured form of hand gestures and the most effective communication way among hear-impaired people. Developing an efficient sign language recognition system to recognize dynamic isolated gestures encounters three major challenges, namely, hand segmentation, hand shape feature representation and gesture sequence recognition. Traditional sign language recognition methods utilize color-based hand segmentation algorithms to segment hands, hand-crafted feature extraction for hand shape representation and Hidden Markov Model (HMM) for sequence recognition. In this paper, a novel framework is proposed for signerindependent sign language recognition using multiple deep learning architectures comprising hand semantic segmentation, hand shape feature representation and deep recurrent neural network. The recently developed semantic segmentation method called DeepLabv3+ is trained using a set of pixel-labeled hand images to extract hand regions from each frame of the input video. Then, the extracted hand regions are cropped and scaled to a fixed size to alleviate hand scale variations. Extracting hand shape features is achieved using a single layer Convolutional Self-Organizing Map (CSOM) instead of relying on transfer learning of pretrained deep convolutional neural networks. The sequence of extracted feature vectors are then recognized using deep Bi-directional Long Short-Term Memory (BiLSTM) recurrent neural network. BiLSTM network contains three BiLSTM layers, one fully connected and softmax layers. The performance of the proposed method is evaluated using a challenging Arabic sign language database containing 23 isolated words captured from three different users. Experimental results show that the performance of proposed framework outperforms with large margin the state-of-the-art methods for signer-independent testing strategy. INDEX TERMS Arabic sign language recognition, deep learning, hand semantic segmentation, convolutional self-organizing map, signer-independent, deep BiLSTM network.

show abstract

Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition

Cited by 133 publications

References 35 publications

Evaluation of Pattern Recognition Methods for Head Gesture-Based Interface of a Virtual Reality Helmet Equipped with a Single IMU Sensor

Evaluation of Pattern Recognition Methods for Head Gesture-Based Interface of a Virtual Reality Helmet Equipped with a Single IMU Sensor

AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods

DeepArSLR: A Novel Signer-Independent Deep Learning Framework for Isolated Arabic Sign Language Gestures Recognition

Contact Info

Product

Resources

About