2022
DOI: 10.48550/arxiv.2203.07996
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Abstract: Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pretrained models into a multimodal scenario remains underexplored. In th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(10 citation statements)
references
References 22 publications
0
8
0
Order By: Relevance
“…In Ref. [117] (No. 9, Table 3), the authors achieved 85.00% lip-reading accuracy by making use of unlabelled unimodal data.…”
Section: Lrw Methodology: Visual Speech Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…In Ref. [117] (No. 9, Table 3), the authors achieved 85.00% lip-reading accuracy by making use of unlabelled unimodal data.…”
Section: Lrw Methodology: Visual Speech Recognitionmentioning
confidence: 99%
“…The CTC loss assumes conditional independence between each output prediction and has a form of [117] p CTC (y…”
Section: Lrw Methodology: Visual Speech Recognitionmentioning
confidence: 99%
“…Visual frontend serves as a component to capture the lip motion and reflect the lip position differences in its output representations. Here, we have followed the same procedures as Xichen Pan [ 21 ]. We have truncated the first convolutional layer in MoCo v2, which was pre-trained on ImageNet, and replaced it with a 3D convolutional layer.…”
Section: Architecture and Methodsmentioning
confidence: 99%
“…For audio-only methods, we used the same LU-SSL transformer method proposed by Xichen Pan et al [ 21 ] in 2022, so the WER is consistent with that method, with an error rate of only 2.7%, which is currently the best achieved on the LRS2 dataset.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation