Confidence Measures in Encoder-Decoder Models for Speech Recognition

Confidence estimation, in which we estimate the reliability of each recognized token (e.g., word, sub-word, and character) in automatic speech recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an important function for developing ASR applications. In this study, we perform confidence estimation for end-toend (E2E) ASR hypotheses. Recent E2E ASR systems show high performance (e.g., around 5% token error rates) for various ASR tasks. In such situations, confidence estimation becomes difficult since we need to detect infrequent incorrect tokens from mostly correct token sequences. To tackle this imbalanced dataset problem, we employ a bidirectional long short-term memory (BLSTM)-based model as a strong binary-class (correct/incorrect) sequence labeler that is trained with a class balancing objective. We experimentally confirmed that, by utilizing several types of ASR decoding scores as its auxiliary features, the model steadily shows high confidence estimation performance under highly imbalanced settings. We also confirmed that the BLSTM-based model outperforms Transformerbased confidence estimation models, which greatly underestimate incorrect tokens.

show abstract

“…A good model shows a lower EER value and higher AUC/NCE values. The details of these metrics can be found in [18,26,27,40].…”

Section: Experimental Settingsmentioning

confidence: 99%

BLSTM-Based Confidence Estimation for End-to-End Speech Recognition

Ogawa

Tawara

Kano

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In [22], a lightweight CEM that uses internal features of a seq2seq model was proposed to mitigate overconfidence. In [23], softmax temperature values for each token were predicted to adjust overconfident probabilities. In CTCbased ASR models we used in this study, confidence scores can be obtained with the forward-backward algorithm [35], which was reported to perform well [24].…”

Section: Confidence Estimationmentioning

confidence: 99%

“…The proposed rescoring method is closely related to confidence estimation, or the ASR error detection task. Confidence estimation assesses the quality of ASR predictions [18,19,20,21,22,23,24,25], which is useful for many downstream ASR applications such as voice assistants. We demonstrate that our models for rescoring can be applied to confidence estimation without any additional architectural changes or training.…”

Section: Introductionmentioning

confidence: 99%

ASR Rescoring and Confidence Estimation with ELECTRA

Futami¹,

Inaguma²,

Mimura³

et al. 2021

Preprint

View full text Add to dashboard Cite

In automatic speech recognition (ASR) rescoring, the hypothesis with the fewest errors should be selected from the n-best list using a language model (LM). However, LMs are usually trained to maximize the likelihood of correct word sequences, not to detect ASR errors. We propose an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP tasks. ELECTRA is pre-trained to predict whether each word is replaced by BERT or not, which can simulate ASR error detection on large text corpora. To make this pre-training closer to ASR error detection, we further propose an extended version of ELECTRA called phone-attentive ELECTRA (P-ELECTRA). In the pre-training of P-ELECTRA, each word is replaced by a phone-to-word conversion model, which leverages phone information to generate acoustically similar words. Since our rescoring method is optimized for detecting errors, it can also be used for word-level confidence estimation. Experimental evaluations on the Librispeech and TED-LIUM2 corpora show that our rescoring method with ELECTRA is competitive with conventional rescoring methods with faster inference. ELECTRA also performs better in confidence estimation than BERT because it can learn to detect inappropriate words not only in fine-tuning but also in pre-training.

show abstract

“…Different network structures like feed forward network (FFN) [22,23], recurrent neural network (RNN) [14,16] and selfattention Transformer [6] could be applied to realize the NCM module. In this study, a residual FFN with three hidden layers is adopted as the classification model.…”

Section: Predictor Featuresmentioning

confidence: 99%

“…However, the softmax probability was found to be unreliable and might perform poorly due to the overconfident behaviour of E2E models [21,22]. To alleviate the problem of unreliability, a neural network can be trained independently to predict a softmax temperature value to re-distribute the original output probabilities at each time step of decoding [23]. In [22], a lightweight neural network was used to estimate neural confidence measure (NCM), which was shown to be more reliable than directly using the Fig.…”

Section: Introductionmentioning

confidence: 99%

Utterance-level neural confidence measure for end-to-end children speech recognition

Liu¹,

Lee²

2021

Preprint

View full text Add to dashboard Cite

Confidence measure is a performance index of particular importance for automatic speech recognition (ASR) systems deployed in real-world scenarios. In the present study, utterance-level neural confidence measure (NCM) in end-toend automatic speech recognition (E2E ASR) is investigated. The E2E system adopts the joint CTC-attention Transformer architecture. The prediction of NCM is formulated as a task of binary classification, i.e., accept/reject the input utterance, based on a set of predictor features acquired during the ASR decoding process. The investigation is focused on evaluating and comparing the efficacies of predictor features that are derived from different internal and external modules of the E2E system. Experiments are carried out on children speech, for which state-of-the-art ASR systems show less than satisfactory performance and robust confidence measure is particularly useful. It is noted that predictor features related to acoustic information of speech play a more important role in estimating confidence measure than those related to linguistic information. N-best score features show significantly better performance than single-best ones. It has also been shown that the metrics of EER and AUC are not appropriate to evaluate the NCM of a mismatched ASR with significant performance gap.

show abstract

Confidence Measures in Encoder-Decoder Models for Speech Recognition

Cited by 14 publications

References 0 publications

BLSTM-Based Confidence Estimation for End-to-End Speech Recognition

BLSTM-Based Confidence Estimation for End-to-End Speech Recognition

ASR Rescoring and Confidence Estimation with ELECTRA

Utterance-level neural confidence measure for end-to-end children speech recognition

Contact Info

Product

Resources

About