SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition

Hu, Hezhen; Zhao, Weichao; Zhou, Wengang; Wang, Yuechen; Li, Houqiang

doi:10.1109/iccv48922.2021.01090

Cited by 44 publications

(35 citation statements)

References 88 publications

(157 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Table 5, we compare with the previous methods. SignBERT (Hu et al 2021a) Our method outperforms SignBERT (Hu et al 2021a) with 4.89%, 5.96% and 9.28% Top-1 per-instance accuracy improvement on MSASL100, MSASL200 and MSASL1000, respectively. Notably, our method even achieves comparable performance with RGB-based methods.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 79%

“…Method MSASL100 MSASL200 MSASL1000 P-I P-C P-I P-C P-I (Yan, Xiong, and Lin 2018) 59 (Yan, Xiong, and Lin 2018) 90.0 SignBERT (Hu et al 2021a) 94.5 Ours 95.4…”

Section: Ablation Studymentioning

confidence: 99%

“…As shown in Table2, ST-GCN(Yan, Xiong, and Lin 2018) shows inferior performance compared with RGB-based methods, which may be attributed to the failure of pose detection. Compared with ST-GCN (Yan, Xiong, and Lin 2018), self-supervised learning methods, i.e., SignBERT(Hu et al 2021a) and Ours, relieve this issue by leveraging the stored statistics during pre-training.…”

mentioning

confidence: 99%

See 2 more Smart Citations

BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization

Zhao¹,

Hu²,

Zhou³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this work, we are dedicated to leveraging the BERT pretraining success and modeling the domain-specific statistics to fertilize the sign language recognition (SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state. After pre-training, we fine-tune the pretrained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 79%

“…Method MSASL100 MSASL200 MSASL1000 P-I P-C P-I P-C P-I (Yan, Xiong, and Lin 2018) 59 (Yan, Xiong, and Lin 2018) 90.0 SignBERT (Hu et al 2021a) 94.5 Ours 95.4…”

Section: Ablation Studymentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization

Zhao¹,

Hu²,

Zhou³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Sign language recognition methods can be roughly categorized into isolated sign language recognition [19,20,44] and continuous sign language recognition [5,7,34,35,38] (CSLR), and we focus on the latter in this paper. CSLR tries to translate image frames into corresponding glosses in a weakly-supervised way: only sentence-level label is provided.…”

Section: Continuous Sign Language Recognitionmentioning

confidence: 99%

Continuous Sign Language Recognition with Correlation Network

Hu¹,

Gao²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Human body trajectories are a salient cue to identify actions in the video. Such body trajectories are mainly conveyed by hands and face across consecutive frames in sign language. However, current methods in continuous sign language recognition (CSLR) usually process frames independently, thus failing to capture cross-frame trajectories to effectively identify a sign. To handle this limitation, we propose correlation network (CorrNet) to explicitly capture and leverage body trajectories across frames to identify signs. In specific, a correlation module is first proposed to dynamically compute correlation maps between the current frame and adjacent frames to identify trajectories of all spatial patches. An identification module is then presented to dynamically emphasize the body trajectories within these correlation maps. As a result, the generated features are able to gain an overview of local temporal movements to identify a sign. Thanks to its special attention on body trajectories, CorrNet achieves new state-of-the-art accuracy on four large-scale datasets, i.e., PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. A comprehensive comparison with previous spatial-temporal reasoning methods verifies the effectiveness of CorrNet. Visualizations demonstrate the effects of CorrNet on emphasizing human body trajectories across adjacent frames.

show abstract

“…In the domain of continuous sign language recognition, in which the objective is to infer a sequence of sign glosses, prior work has explored HMMs [3,36] in combination with Dynamic Time Warping (DTW) [73], RNNs [18] and architectures capable of learning effectively from CTC losses [15,75]. Recently, sign representation learning methods inspired by BERT [20] have shown the potential to learn effective representations for both isolated [24] and continuous [76] recognition. Koller [35] provides an extensive survey of the sign recognition literature, highlighting the extremely limited supply of datasets with large-scale vocabularies suitable for continuous sign language recognition.…”

Section: Related Workmentioning

confidence: 99%

Automatic dense annotation of large-vocabulary sign language videos

Momeni¹,

Bull²,

Prajwal³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, sign language researchers have turned to sign language interpreted TV broadcasts, comprising (i) a video of continuous signing and (ii) subtitles corresponding to the audio content, as a readily available and large-scale source of training data. One key challenge in the usability of such data is the lack of sign annotations. Previous work exploiting such weakly-aligned data only found sparse correspondences between keywords in the subtitle and individual signs. In this work, we propose a simple, scalable framework to vastly increase the density of automatic annotations. Our contributions are the following:(1) we significantly improve previous annotation methods by making use of synonyms and subtitle-signing alignment; (2) we show the value of pseudo-labelling from a sign recognition model as a way of sign spotting;(3) we propose a novel approach for increasing our annotations of known and unknown classes based on in-domain exemplars; (4) on the BOBSL BSL sign language corpus, we increase the number of confident automatic annotations from 670K to 5M. We make these annotations publicly available to support the sign language research community.

show abstract

SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition

Cited by 44 publications

References 88 publications

BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization

BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization

Continuous Sign Language Recognition with Correlation Network

Automatic dense annotation of large-vocabulary sign language videos

Contact Info

Product

Resources

About