Alleviating ASR Long-Tailed Problem by Decoupling the Learning of Representation and Classification

Deng, Keqi; Cheng, Gaofeng; Yang, Runyan; Yan, Yonghong

doi:10.1109/taslp.2021.3138707

Cited by 13 publications

(7 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous work [4,13,12,14] focused on the encoder-decoder E2E ASR structure rather than CTC-based structure to estimate the internal LM because CTC-based models are generally not considered capable of modelling context between output tokens due to conditional independence assumption. However, CTC-based E2E ASR models learn the training data distribution and are affected by the frequency of words in the training data [23]. The CTC-based model therefore at least has the modelling ability of a uni-gram LMs, and this paper aims to adapt it to the target domain effectively without re-training during inference.…”

Section: Residual Softmax (R-softmax)mentioning

confidence: 99%

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

Deng¹,

Woodland²

2023

Preprint

View full text Add to dashboard Cite

End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audiotranscript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replace the internal language model (LM) of E2E ASR models with a target-domain LM in the decoding stage when a domain shift is encountered. Furthermore, this paper proposes a residual softmax (R-softmax) that is designed for CTC-based E2E ASR models to adapt to the target domain without re-training during inference. For E2E ASR models trained on the LibriSpeech corpus, experiments showed that the proposed methods gave a 2.6% absolute WER reduction on the Switchboard data and a 1.0% WER reduction on the AESRC2020 corpus while maintaining intra-domain ASR results.

show abstract

Section: Residual Softmax (R-softmax)mentioning

confidence: 99%

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

Deng¹,

Woodland²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Secondly, in overcoming the problem of uneven data distribution in multilingual speech recognition tasks, Winata et al [45] attempted to improve the recognition rate of multilingual speech recognition by pre-training language models and using class priors to adjust the output of the softmax function. To alleviate the long-tail problem of single language in speech recognition, Deng et al used a two-step training approach, i.e., representation learning and classification learning, in an end-to-end speech recognition model as a way to improve the recognition of low-frequency words by trying to add multiple loss functions (for example, by adding a softmax loss function with temperature in Transformer decoder) and pre-training the language model [46]. Previous work studies have not explored the long-tail problem in single small language speech recognition.…”

Section: Related Workmentioning

confidence: 99%

“…Inspired by ref. [46], this article improves the output of the softmax function in the Conformer model and solves the problem of uneven data distribution by adding a penalty factor into the softmax classifier in the Attention model structure. The penalty factor is similar to the temperature in knowledge distillation [48].…”

Section: Balanced Softmaxmentioning

confidence: 99%

Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax

2023

View full text Add to dashboard Cite

Recently, the performance of end-to-end speech recognition has been further improved based on the proposed Conformer framework, which has also been widely used in the field of speech recognition. However, the Conformer model is mostly applied to very widespread languages, such as Chinese and English, and rarely applied to speech recognition of Central and West Asian agglutinative languages. There are more network parameters in the Conformer end-to-end speech recognition model, so the structure of the model is complex, and it consumes more resources. At the same time, we found that there is a long-tail problem in Kazakh, i.e., the distribution of high-frequency words and low-frequency words is not uniform, which makes the recognition accuracy of the model low. For these reasons, we made the following improvements to the Conformer baseline model. First, we constructed a low-rank multi-head self-attention encoder and decoder using low-rank approximation decomposition to reduce the number of parameters of the multi-head self-attention module and model’s storage space. Second, to alleviate the long-tail problem in Kazakh, the original softmax function was replaced by a balanced softmax function in the Conformer model; Third, we use connectionist temporal classification (CTC) as an auxiliary task to speed up the model training and build a multi-task lightweight but efficient Conformer speech recognition model with hybrid CTC/Attention. To evaluate the effectiveness of the proposed model, we conduct experiments on the open-source Kazakh language dataset, during which no external language model is used, and the number of parameters is relatively compressed by 7.4% and the storage space is relatively reduced by 13.5 MB, while the training speed and word error rate remain basically unchanged.

show abstract

“…We choose to incorporate BERT [19] into our ASR system due to its powerful text processing capabilities enabled by its embedding layer and a multi-layer Transformer encoder [19,20]. As shown in Fig.…”

Section: Modality Conversion Mechanism and Bertmentioning

confidence: 99%

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Deng¹,

Zhang²,

Watanabe³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pretrained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention joint decoding method to improve the recognition accuracy while keeping the decoding speed fast. Experimental results show that the proposed NAR model greatly outperforms our strong wav2vec2.0 CTC baseline (15.1% relative CER reduction on AISHELL-1). The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.

show abstract

Alleviating ASR Long-Tailed Problem by Decoupling the Learning of Representation and Classification

Cited by 13 publications

References 40 publications

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Contact Info

Product

Resources

About