Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Makino, Takaki; Liao, Hank; Assael, Yannis M.; Shillingford, Brendan; Garcia, Basilio; Braga, Otavio; Siohan, Olivier

doi:10.1109/asru46091.2019.9004036

Cited by 84 publications

(65 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Training. For training, we use over 50k hours of transcribed short YouTube video segments extracted with the semi-supervised procedure originally proposed in [17] and extended in [1,18] to include video. We extract short segments where the force-aligned user uploaded transcription matches the transcriptions from a production quality ASR system.…”

Section: Datasetsmentioning

confidence: 99%

“…Up until recently, audio-visual ASR had been studied only under the ideal scenario where the video of a single speaking face matches the audio [1,2,3,4,5,6]. However, at inference time when multiple people are simultaneously visible on screen we need to decide which face to feed to the model, and this issue of active speaker selection has traditionally been considered a separate problem [7,8].…”

Section: Introductionmentioning

confidence: 99%

“…Secondly, we measure under stronger scrutiny the effectiveness of the proposed end-to-end approach versus a two-step system connected with a hard decision boundary. For the baseline, we couple a separately trained active speaker selection module with a state-of-the-art audio-visual 1 obraga@google.com ASR model that can handle a single face track. With experiments involving over 50 thousand hours of YouTube videos as training data, we show that the end-to-end model performs at least as well as the considerably bigger two-step system under various noise conditions and number of parallel face tracks, while still showing benefits over an audio-only model.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Braga

Siohan

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model. One interesting finding was that the attention indirectly learns the association between the audio and the speaking face even though this correspondence is never explicitly provided at training time. In the present work we further investigate this connection and examine the interplay between the two problems. With experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task. Secondly, we show under closer scrutiny that an end-to-end model performs at least as well as a considerably larger two-step system that utilizes a hard decision boundary under various noise conditions and number of parallel face tracks.

show abstract

Section: Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Braga

Siohan

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Extensive audio-visual speech recognition technologies have been conducted in recent years and demonstrated their efficacy in improving speech recognition performance under both clean and adverse conditions [32], [35], [52]- [57]. Following [49], in this work, the convolutional long short-term memory fully connected neural network (CLDNN) [58] is adopted as the recognition back-end system architecture.…”

Section: A Audio-visual Speech Recognition Back-endmentioning

confidence: 99%

Audio-Visual Multi-Channel Recognition of Overlapped Speech

Zhang

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

Automatic speech recognition (ASR) technologies have been significantly advanced in the past few decades. However, recognition of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in current ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption and the additional cues they provide to separate the target speaker from the interfering sound sources, this paper presents an audiovisual multi-channel based recognition system for overlapped speech. It benefits from a tight integration between a speech separation front-end and recognition back-end, both of which incorporate additional video input. A series of audiovisual multichannel speech separation front-end components based on TF masking, Filter&Sum and mask-based MVDR neural channel integration approaches are developed. To reduce the error cost mismatch between the separation and recognition components, the entire system is jointly fine-tuned using a multi-task criterion interpolation of the scale-invariant signal to noise ratio (Si-SNR) with either the connectionist temporal classification (CTC), or lattice-free maximum mutual information (LF-MMI) loss function. Experiments suggest that: the proposed audiovisual multichannel recognition system outperforms the baseline audio-only multi-channel ASR system by up to 8.04% (31.68% relative) and 22.86% (58.51% relative) absolute WER reduction on overlapped speech constructed using either simulation or replaying of the LRS2 dataset respectively. Consistent performance improvements are also obtained using the proposed audiovisual multi-channel recognition system when using occluded video input with the face region randomly covered up to 60%.

show abstract

“…Several audio-visual approaches have been recently presented where pre-computed visual or audio features are used [1,19,25,29,34]. Afouras et al developed a transformer-based sequence-tosequence model by using pre-computed visual features and log-Mel filter-bank features as inputs.…”

Section: Introductionmentioning

confidence: 99%

End-To-End Audio-Visual Speech Recognition with Conformers

Petridis

Pantić

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

102

113

View full text Add to dashboard Cite

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

show abstract

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Cited by 84 publications

References 14 publications

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Audio-Visual Multi-Channel Recognition of Overlapped Speech

End-To-End Audio-Visual Speech Recognition with Conformers

Contact Info

Product

Resources

About