2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2021
DOI: 10.1109/cvprw53098.2021.00383
|View full text |Cite
|
Sign up to set email alerts
|

Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention

Abstract: Automatic sign language recognition lies at the intersection of natural language processing (NLP) and computer vision. The highly successful transformer architectures, based on multi-head attention, originate from the field of NLP. The Video Transformer Network (VTN) is an adaptation of this concept for tasks that require video understanding, e.g., action recognition. However, due to the limited amount of labeled data that is commonly available for training automatic sign (language) recognition, the VTN cannot… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 33 publications
(13 citation statements)
references
References 22 publications
0
13
0
Order By: Relevance
“…The author used a custom dataset of over 24624 images for the experiment. Mathieu De Coster et al, [8] proposed a sign language recognition methodology over the Flemish Sign Language corpus. The author has used OpenPose feature extraction and end-toend learning with CNN, and applied a multi-head attention approach to isolated sign recognition.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The author used a custom dataset of over 24624 images for the experiment. Mathieu De Coster et al, [8] proposed a sign language recognition methodology over the Flemish Sign Language corpus. The author has used OpenPose feature extraction and end-toend learning with CNN, and applied a multi-head attention approach to isolated sign recognition.…”
Section: Related Workmentioning
confidence: 99%
“…Over the class of 100 signs, 74.7% accuracy has been obtained as a state-of-the-art result over the Flemish Sign Language Corpus. The author introduces the Multimodal Transformer Network with Pose LSTM and Pose Transformer, especially self-attention for sign language recognition [8]. Mannan A. et al, [9] proposed Hypertuned DeepCNN for American Static sign language, author has used data augmentation to create more number of learning data sample, as deep learning model accuracy will increase with more samples for the training process.…”
Section: Related Workmentioning
confidence: 99%
“…Other recent work employed a video transformer network (VTN) for sign language recognition [34]. VTN is a modified version of the transformer that was deployed for machine translation [35].…”
Section: Related Workmentioning
confidence: 99%
“…The proposed architecture was compared with the state-of-the-art graph-based architecture on both AUTSL and ASLLVD datasets. In Table 3, the performance of the proposed architecture is compared with the reported results for different variants of the VTN architecture on the AUTSL dataset [34]. From Table 3, we can observe that the proposed architecture with spatial attention enhancement outperformed the best variant of VTN (VTN-PF) on both the validation and test datasets.…”
mentioning
confidence: 99%
“…They achieved a top 1 accuracy of 69.9% for NMFs-CSL datasets and 96.8% for isolated SLR 500 datasets. De Coster et al [141] proposed Pose flow and hand cropping associated to video transformer network-based isolated sign language recognition. The VTN-PF (Video Transformer Network with hand cropping and pose) model evaluation on the AUTSL dataset got an accuracy of 92.92 %.…”
Section: B Study Of Current State-of-the-art Models For Sign Language...mentioning
confidence: 99%