Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Burchi, Maxime; Vielzeuf, Valentin

doi:10.1109/asru51503.2021.9687874

Cited by 37 publications

(21 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The back-end networks use an Efficient Conformer architecture. The Efficient Conformer encoder was proposed in [7], it is composed of several stages where each stage comprises a number of Conformer blocks [16] using grouped attention with relative positional encodings. The temporal sequence is progressively downsampled using strided convolutions and projected to wider feature dimensions, lowering the amount of computation while achieving better performance.…”

Section: Model Architecturementioning

confidence: 99%

“…The Efficient Conformer [7] proposed to replace Multi-Head Self-Attention (MHSA) [44] in earlier encoder layers with grouped attention. Grouped MHSA reduce attention complexity by grouping neighbouring temporal elements along the feature dimension before applying scaled dot-product attention.…”

Section: Patch Attentionmentioning

confidence: 99%

“…Concurrently, Nozaki et al [33] improved CTC-based speech recognition by conditioning intermediate encoder block features on early predictions using intermediate CTC losses [14]. Burchi et al [7] also proposed an Efficient Conformer architecture using grouped attention for speech recognition, lowering the amount of computation while achieving better performance. Inspired from computer vision backbones, the Efficient Conformer encoder is composed of multiple stages where each stage comprises a number of Conformer blocks to progressively downsample and project the audio sequence to wider feature dimensions.…”

mentioning

confidence: 99%

“…In this work we focus on the design of a noise robust speech recognition architecture processing both audio and visual modalities. We use the recently proposed CTCbased Efficient Conformer architecture [7] and show that including the visual modality of lip movements can successfully improve noise robustness while significantly accelerating training. Our Audio-Visual Efficient Conformer (AVEC) reaches lower WER using 4 times less training steps than its audio-only counterpart.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Audio-Visual Efficient Conformer for Robust Speech Recognition

Burchi¹,

Timofte²

2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Self Cite

View full text Add to dashboard Cite

End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale handlabeled datasets and sufficient computing resources made it possible to train powerful deep neural networks, reaching very low Word Error Rate (WER) on academic benchmarks. However, despite impressive performance on clean audio samples, a drop of performance is often observed on noisy speech. In this work, we propose to improve the noise robustness of the recently proposed Efficient Conformer Connectionist Temporal Classification (CTC)-based architecture by processing both audio and visual modalities. We improve previous lip reading methods using an Efficient Conformer back-end on top of a ResNet-18 visual front-end and by adding intermediate CTC losses between blocks. We condition intermediate block features on early predictions using Inter CTC residual modules to relax the conditional independence assumption of CTC-based models. We also replace the Efficient Conformer grouped attention by a more efficient and simpler attention mechanism that we call patch attention. We experiment with publicly available Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3) datasets. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps. Our Audio-Visual Efficient Conformer (AVEC) model achieves state-of-the-art performance, reaching WER of 2.3% and 1.8% on LRS2 and LRS3 test sets. Code and pretrained models are available at https://github.com/burchim/AVEC.

show abstract

Section: Model Architecturementioning

confidence: 99%

Section: Patch Attentionmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Audio-Visual Efficient Conformer for Robust Speech Recognition

Burchi¹,

Timofte²

2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Self Cite

View full text Add to dashboard Cite

show abstract

“…It uses * Equal contribution a convolution module to capture local context dependencies in addition to the long context captured by the self-attention module. The conformer architecture was investigated for different end-to-end systems such as attention encoder-decoder models [12,13], and recurrent neural network transducer [10,14]. Nevertheless, there has been no work investigating the impact of using a conformer AM for hybrid ASR systems.…”

Section: Introduction and Related Workmentioning

confidence: 99%

Conformer-based Hybrid ASR System for Switchboard Dataset

Zeineldeen¹,

Xu²,

Lüscher³

et al. 2021

Preprint

View full text Add to dashboard Cite

The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowledge, the impact of using conformer acoustic model for hybrid ASR is not investigated. In this paper, we present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve worderror-rate as well as to increase training speed. We apply time downsampling methods for efficient training and use transposed convolutions to upsample the output sequence again. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results compared to other architectures. It generalizes very well on Hub5'01 test set and outperforms the BLSTM-based hybrid model significantly.

show abstract

Quality Assurance for Speech Synthesis with ASR

Peinl

Wirth

2022

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Cited by 37 publications

References 25 publications

Audio-Visual Efficient Conformer for Robust Speech Recognition

Audio-Visual Efficient Conformer for Robust Speech Recognition

Conformer-based Hybrid ASR System for Switchboard Dataset

Quality Assurance for Speech Synthesis with ASR

Contact Info

Product

Resources

About