2019
DOI: 10.1109/tcsvt.2018.2870740
|View full text |Cite
|
Sign up to set email alerts
|

Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
103
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 133 publications
(115 citation statements)
references
References 35 publications
0
103
0
Order By: Relevance
“…If action recognition is performed on raw video data authors prefer to use convolution neural networks where convolution layer is used to generate features. Those features are then processed by a fully connected neural network which performs classification [16]. Sometimes an input raw signal is processed by convolution layer followed by recurrent network to avoid a sliding window design and then classified by a fully connected neural network [17].…”
Section: Effective Methods Of Human Motion Analysis and Classificationmentioning
confidence: 99%
“…If action recognition is performed on raw video data authors prefer to use convolution neural networks where convolution layer is used to generate features. Those features are then processed by a fully connected neural network which performs classification [16]. Sometimes an input raw signal is processed by convolution layer followed by recurrent network to avoid a sliding window design and then classified by a fully connected neural network [17].…”
Section: Effective Methods Of Human Motion Analysis and Classificationmentioning
confidence: 99%
“…These models learn the relevant spatial or temporal parts of the image or video automatically from data. These models have also been used in the SLR domain [2], [8], [34], [36], [40].…”
Section: Related Workmentioning
confidence: 99%
“…Then LSTM is used to model the temporal characteristics of the stream. In the recent years, some studies use 3D-CNNs in order to capture spatial-temporal features together [2], [3], [37]. In [3], pose based and visual appearance based approaches are compared.…”
Section: A Sign Language Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…LS-HAN contains three components, namely, two-stream Convolutional Neural Network (CNN) for video feature representation, a Latent Space (LS) to bridge semantic gap, and a Hierarchical Attention Network (HAN) for recognition. Huang et al [34] presented an attention-based 3D-convolutional neural networks (3D-CNNs). This model can learn spatial and temporal features from raw video and the attention mechanism helps to focus on the areas of interest.…”
Section: Related Workmentioning
confidence: 99%