Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Zhao, Yili; Xu, Rui; Wang, Xinchao; Hou, Peng; Tang, Haitao; Song, Mingli

doi:10.48550/arxiv.1911.11502

Cited by 3 publications

(8 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that experiment on GRID dataset needs more training steps, since it is trained with its visual frontend together from scratch, different from experiments on LRS2 dataset. Moreover, the first 45k steps in warm-up stage for LRS2 are trained on LRS2-pretrain sub-dataset and all the left steps are trained on LRS2-main sub-dataset [1,2,33].…”

Section: Training Setupmentioning

confidence: 99%

“…The former surpasses the performance of all previous work on LRS2-BBC dataset by a large margin. To boost the performance of lipreading, Petridis et al [19] present a hybrid CTC/Attention architecture aiming to obtain the better alignment than attention-only mechanism, Zhao et al [33] provide the idea that transferring knowledge from audio-speech recognition model to lipreading model by distillation.…”

Section: Related Work 21 Autoregressive Deep Lipreadingmentioning

confidence: 99%

“…And we make use of the pre-train dataset provided by LRS2 which contains 96k sentences for pretraining. Following previous works [1,2,33], the input video frames are converted to grey scale and centrally cropped into 114 × 114 images. As for the text sentence, we split each word token into subwords using BPE [23], and set the vocabulary size to 1k considering the vocabulary size of LRS2.…”

Section: Datasetsmentioning

confidence: 99%

“…It is worth noting that, training the visual front-end together with the main model could obtain poor results on LRS2, which is observed in previous works [1]. Thus, as Zhao et al [33] do, we utilize the frozen visual front-end provided by Afouras et al [1], which is pretrained on a non-public datasets MV-LRS [8], to exact the visual features. And then, we train FastLR on these features end-to-end.…”

Section: Visual Feature Extractionmentioning

confidence: 99%

“…Existing methods mainly adopt autoregressive (AR) model, either based on RNN [25,33], or Transformer [1,2]. Those systems generate each target token conditioned on the sequence of tokens generated previously, which hinders the parallelizability.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

Liu,

Ren,

Zhao

et al. 2020

Preprint

View full text Add to dashboard Cite

Lipreading is an impressive technique and there has been a definite improvement of accuracy in recent years. However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency. To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading. Thus, in this paper, we introduce three methods to reduce the gap between FastLR and AR model: 1) to address challenges 1 and 2, we leverage integrate-and-fire (I&F) module to model the correspondence between source video frames and output text sequence. 2) To tackle challenge 3, we add an auxiliary connectionist temporal classification (CTC) decoder to the top of the encoder and optimize it with extra CTC loss. We also add an auxiliary autoregressive decoder to help the feature extraction of encoder. 3) To overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I&F and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit that FastLR achieves the speedup up to 10.97× comparing with state-of-the-art lipreading model with slight WER absolute increase of 1.5% and 5.5% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method. 1

show abstract

Section: Training Setupmentioning

confidence: 99%

Section: Related Work 21 Autoregressive Deep Lipreadingmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Section: Visual Feature Extractionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

Liu,

Ren,

Zhao

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

Cantonese sentence dataset for lip‐reading

Xiao,

Liu,

Teng

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

Lip‐reading deciphers speech by observing lip movements without relying on audio data. The rapid advancements in deep learning have significantly improved lip‐reading for both English and Chinese; however, research on dialects such as Cantonese remains scarce. Consequently, most Chinese lip‐reading datasets focus on Mandarin, with only a few addressing Cantonese. To bridge this gap, a sentence‐level Cantonese lip‐reading dataset, designated as Cantonese lip‐reading sentences are introduced, comprising over 500 unique speakers and more than 30,000 samples. To ensure alignment with real‐world scenarios, no restrictions are imposed on factors such as gender, age, posture, lighting conditions, or speech rate. A comprehensive description of the pipeline employed is provided for collecting and constructing the dataset and introduce an innovative visual frontend, 3D‐visual attention net. This frontend combines the advantages of convolution and self‐attention mechanisms to extract fine‐grained lip region features. These features are subsequently input into the conformer backend for temporal sequence modelling, achieving comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip‐reading sentences datasets. Benchmark tests on Cantonese lip‐reading sentences demonstrate the challenges it poses, providing a novel research foundation for dialect lip‐reading and fostering the advancement of Cantonese lip‐reading tasks.

show abstract