Sub-word Level Lip Reading With Visual Attention

Prajwal, K R; Afouras, Triantafyllos; Zisserman, A.

doi:10.1109/cvpr52688.2022.00510

Cited by 51 publications

(36 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Random cropping with size 88×88 and horizontal flipping are also performed for each video during training. We also follow Prajwal et al [37] using central crop with horizontal flipping at test time for visual-only experiments.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…Other works focus on Visual Speech Recognition (VSR), only using lip movements to transcribe spoken language into text [4,9,48,3,49,37,30]. An important line of research is the use of cross-modal distillation.…”

mentioning

confidence: 99%

“…Afouras et al [3] and Zhao et al [49] proposed to improve the lip reading performance by distilling from an ASR model trained on a large-scale audio-only corpus while Ma et al [30] uses prediction-based auxiliary tasks. Prajwal et al [37] also proposed to use sub-words units instead of characters to transcribe sequences, greatly reducing running time and memory requirements. Also providing a language prior, reducing the language modelling burden of the model.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Audio-Visual Efficient Conformer for Robust Speech Recognition

Burchi¹,

Timofte²

2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale handlabeled datasets and sufficient computing resources made it possible to train powerful deep neural networks, reaching very low Word Error Rate (WER) on academic benchmarks. However, despite impressive performance on clean audio samples, a drop of performance is often observed on noisy speech. In this work, we propose to improve the noise robustness of the recently proposed Efficient Conformer Connectionist Temporal Classification (CTC)-based architecture by processing both audio and visual modalities. We improve previous lip reading methods using an Efficient Conformer back-end on top of a ResNet-18 visual front-end and by adding intermediate CTC losses between blocks. We condition intermediate block features on early predictions using Inter CTC residual modules to relax the conditional independence assumption of CTC-based models. We also replace the Efficient Conformer grouped attention by a more efficient and simpler attention mechanism that we call patch attention. We experiment with publicly available Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3) datasets. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps. Our Audio-Visual Efficient Conformer (AVEC) model achieves state-of-the-art performance, reaching WER of 2.3% and 1.8% on LRS2 and LRS3 test sets. Code and pretrained models are available at https://github.com/burchim/AVEC.

show abstract

Section: Implementation Detailsmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Audio-Visual Efficient Conformer for Robust Speech Recognition

Burchi¹,

Timofte²

2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

show abstract

“…Audio and visual speeches are two separate modalities that convey speech content. Numerous works [42,12,1,2,44,24,26] have explored ways to extract information from speech using these modalities. Speech recognition [42,6,21] is widely used in online meetings and social applications to recognize speech content.…”

Section: Related Work 21 Audio-visual Speechmentioning

confidence: 99%

“…Keyword spotting [5,49,28] is employed in short video applications to quickly retrieve relevant content. Additionally, in noisy scenarios, relevant speech tasks [13,20,44,39] rely on visual speech to avoid interference from surrounding speech and background noise. Despite the growing interest in speech tasks that rely on visual speech, researches [54,57] on visual speech translation are limited and lacks validation due to the lack of multilingual audio-visual speech transcription datasets.…”

Section: Related Work 21 Audio-visual Speechmentioning

confidence: 99%

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

Xize¹,

Li²,

Jin³

et al. 2023

Preprint

View full text Add to dashboard Cite

Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and translated text pairs. In this paper, we present AVMuST-TED, the first dataset for Audio-Visual Multilingual Speech Translation, derived from TED talks. Nonetheless, visual speech is not as distinguishable as audio speech, making it difficult to develop a mapping from source speech phonemes to the target language text. To address this issue, we propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. To further minimize the cross-modality gap and its impact on knowledge transfer, we suggest adopting mixed speech, which is created by interpolating audio and visual streams, along with a curriculum learning strategy to adjust the mixing ratio as needed. MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2. Moreover, it achieves state-of-the-art performance in lip reading on CMLR (11.1%), LRS2 (25.5%), and LRS3 (28.0%).

show abstract