2019
DOI: 10.1109/access.2019.2959206
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Spatiotemporal Networks for Sign Language Recognition

Abstract: Different from other human behaviors, sign language has the characteristics of limited local motion of upper limb and meticulous hand action. Some sign language gestures are ambiguous in RGB video due to the influence of lighting and background color, which affects the recognition accuracy. We propose a multimodal deep learning architecture for sign language recognition which effectively combines RGB-D input and two-stream spatiotemporal networks. Depth videos, as an effective compensation of RGB input, can su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 25 publications
(14 citation statements)
references
References 47 publications
(45 reference statements)
0
14
0
Order By: Relevance
“…The aforementioned previous study achieved 72.73% accuracy on the leap motion data alone, which is the same dataset used in the experiments in this article. Zhang et al's 2019 study [25] discovered that multi-modality could drastically improve sign recognition when fusing RGB and Depth data. The model presented by the study was computationally expensive, requiring two VGG16 convolutional neural networks to process the sensor information.…”
Section: Background and Related Workmentioning
confidence: 99%
“…The aforementioned previous study achieved 72.73% accuracy on the leap motion data alone, which is the same dataset used in the experiments in this article. Zhang et al's 2019 study [25] discovered that multi-modality could drastically improve sign recognition when fusing RGB and Depth data. The model presented by the study was computationally expensive, requiring two VGG16 convolutional neural networks to process the sensor information.…”
Section: Background and Related Workmentioning
confidence: 99%
“…These features were finally concatenated and fed to an encoder-decoder LSTM network that predicted sub-words that form the signed word. Zhang et al in [ 91 ], proposed a highly accurate SLR method that initially selected pairs of aligned RGB-D images to reduce redundancy. Then, the proposed method computed discriminative features from hand regions using a spatial stream and extracted depth motion features using a temporal stream.…”
Section: Sign Language Recognitionmentioning
confidence: 99%
“…Zhang et al [25] used RGB and depth images together in their study and gained a 6% of improvement compared to the sole use of RGB. Also, they reported that depth images were more robust against the changes in the light and environment; and they were able to capture the signs better.…”
Section: Related Workmentioning
confidence: 99%