2020
DOI: 10.1109/access.2020.2993650
|View full text |Cite
|
Sign up to set email alerts
|

Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space

Abstract: Continuous Sign Language Recognition (CSLR) refers to the challenging problem of recognizing sign language glosses and their temporal boundaries from weakly annotated video sequences. Previous methods focus mostly on visual feature extraction neglecting text information and failing to effectively model the intra-gloss dependencies. In this work, a cross-modal learning approach that leverages text information to improve vision-based CSLR is proposed. To this end, two powerful encoding networks are initially use… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
31
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

2
6

Authors

Journals

citations
Cited by 61 publications
(31 citation statements)
references
References 33 publications
(45 reference statements)
0
31
0
Order By: Relevance
“…The inception network and self-attention networks were optimized jointly with clip-level feature learning and sequence learning. On the other hand, in [ 7 ], a cross-modal approach was proposed for RGB-based CSLR. The extracted video and text representations were aligned into a joint latent space while a jointly trained decoder was employed.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The inception network and self-attention networks were optimized jointly with clip-level feature learning and sequence learning. On the other hand, in [ 7 ], a cross-modal approach was proposed for RGB-based CSLR. The extracted video and text representations were aligned into a joint latent space while a jointly trained decoder was employed.…”
Section: Related Workmentioning
confidence: 99%
“…This is mainly because sign languages feature thousands of signs, sometimes differing only by subtle changes in hand motion, shape, or position and involving significant finger overlaps and occlusions [ 2 ]. SLR tasks are divided into Isolated Sign Language Recognition (ISLR) [ 3 , 4 , 5 ] and Continuous Sign Language Recognition (CSLR) [ 6 , 7 , 8 ]. The CSLR task focuses on recognizing sequences of glosses from videos without predefined annotation boundaries, and it is more challenging compared to ISLR [ 9 ], in which the temporal boundaries of glosses in the videos are predefined.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Accurate hand pose estimation can enhance user experience in VR systems by enabling the performance of plausible and realistic virtual hand movements, as well as contribute towards a better understanding of human actions in smart HCI systems [5][6][7][8], thus enabling a more intelligent interaction between users and smart systems. Apart from the aforementioned applications, hand pose estimation is crucial in a number of other tasks, such as gesture recognition [9,10], action recognition [11,12], support systems for motor impairments patients [13], sign language recognition [14][15][16][17][18], and representation [19]. In sign language recognition especially, accurate hand pose estimation is beneficial for promoting social inclusion and enhancing accessibility in the Deaf community.…”
Section: Introductionmentioning
confidence: 99%
“…As a result, recognition accuracy could be improved to a WER of . Lastly, Papastratis et al [ 25 ] showed that that the accuracy of sentence predictions could be further enhanced by a cross-modal learning approach that leverages text information.…”
Section: Introductionmentioning
confidence: 99%