2022
DOI: 10.1109/taslp.2021.3138707
|View full text |Cite
|
Sign up to set email alerts
|

Alleviating ASR Long-Tailed Problem by Decoupling the Learning of Representation and Classification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 13 publications
(7 citation statements)
references
References 40 publications
0
7
0
Order By: Relevance
“…Previous work [4,13,12,14] focused on the encoder-decoder E2E ASR structure rather than CTC-based structure to estimate the internal LM because CTC-based models are generally not considered capable of modelling context between output tokens due to conditional independence assumption. However, CTC-based E2E ASR models learn the training data distribution and are affected by the frequency of words in the training data [23]. The CTC-based model therefore at least has the modelling ability of a uni-gram LMs, and this paper aims to adapt it to the target domain effectively without re-training during inference.…”
Section: Residual Softmax (R-softmax)mentioning
confidence: 99%
“…Previous work [4,13,12,14] focused on the encoder-decoder E2E ASR structure rather than CTC-based structure to estimate the internal LM because CTC-based models are generally not considered capable of modelling context between output tokens due to conditional independence assumption. However, CTC-based E2E ASR models learn the training data distribution and are affected by the frequency of words in the training data [23]. The CTC-based model therefore at least has the modelling ability of a uni-gram LMs, and this paper aims to adapt it to the target domain effectively without re-training during inference.…”
Section: Residual Softmax (R-softmax)mentioning
confidence: 99%
“…Secondly, in overcoming the problem of uneven data distribution in multilingual speech recognition tasks, Winata et al [45] attempted to improve the recognition rate of multilingual speech recognition by pre-training language models and using class priors to adjust the output of the softmax function. To alleviate the long-tail problem of single language in speech recognition, Deng et al used a two-step training approach, i.e., representation learning and classification learning, in an end-to-end speech recognition model as a way to improve the recognition of low-frequency words by trying to add multiple loss functions (for example, by adding a softmax loss function with temperature in Transformer decoder) and pre-training the language model [46]. Previous work studies have not explored the long-tail problem in single small language speech recognition.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired by ref. [46], this article improves the output of the softmax function in the Conformer model and solves the problem of uneven data distribution by adding a penalty factor into the softmax classifier in the Attention model structure. The penalty factor is similar to the temperature in knowledge distillation [48].…”
Section: Balanced Softmaxmentioning
confidence: 99%
“…We choose to incorporate BERT [19] into our ASR system due to its powerful text processing capabilities enabled by its embedding layer and a multi-layer Transformer encoder [19,20]. As shown in Fig.…”
Section: Modality Conversion Mechanism and Bertmentioning
confidence: 99%