Optimizing Expected Word Error Rate via Sampling for Speech Recognition

Shannon, Matt

doi:10.21437/interspeech.2017-639

Cited by 42 publications

(36 citation statements)

References 25 publications

(37 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Computing the loss in (4) exactly is intractable since it involves a summation over all possible label sequences. We therefore consider two possible approximations which ensure tractability: approximating the expectation in (4) with samples [3,15], or restricting the summation to an N-best list as is commonly done during sequencetraining for ASR [19].…”

Section: Minimum Word Error Rate Training Of Attention-based Modelsmentioning

confidence: 99%

“…We can approximate the expectation in (4) using an empirical average over samples drawn from the model [15]:…”

Section: Approximation By Samplingmentioning

confidence: 99%

“…In the present work, we consider techniques to optimize attentionbased sequence-to-sequence models in order to directly minimize WER. Our proposed approach is similar to [14,15] in that we approximate the expected WER using hypotheses from the model. We consider both the use of sampling-based approaches [14,15] as well as approximating the loss over N-best lists of recognition hypotheses as is commonly done in ASR (e.g., [19]).…”

Section: Introductionmentioning

confidence: 99%

“…Our proposed approach is similar to [14,15] in that we approximate the expected WER using hypotheses from the model. We consider both the use of sampling-based approaches [14,15] as well as approximating the loss over N-best lists of recognition hypotheses as is commonly done in ASR (e.g., [19]). However, unlike Sak et al [3] we find that the process is more effective if we approximate the expectation using N-best hypotheses decoded from the model using beam-search [20] rather than sampling from the model (See section 5.1).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Prabhavalkar

Sainath

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

148

113

View full text Add to dashboard Cite

Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the loglikelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER.In the present work, we explore techniques to train attentionbased models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.Index Termssequence-to-sequence models, attention models, minimum word error rate training, minimum Bayes risk

show abstract

Section: Minimum Word Error Rate Training Of Attention-based Modelsmentioning

confidence: 99%

“…We can approximate the expectation in (4) using an empirical average over samples drawn from the model [15]:…”

Section: Approximation By Samplingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Prabhavalkar

Sainath

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

148

113

View full text Add to dashboard Cite

show abstract

“…In our setup each training utterance is combined with 20 different noises (room reverberations, background music, cafe noises) at a SNR ranging from 5dB to 25dB. We found that best results are obtained when noise is added when using CTC/Cross Entropy (CE) training criteria and original audio is used while doing EMBR training [11].…”

Section: Model Trainingmentioning

confidence: 99%

Speech Recognition for Medical Conversations

et al. 2018

View full text Add to dashboard Cite

In this paper we document our experiences with developing speech recognition for medical transcription -a system that automatically transcribes doctor-patient conversations. Towards this goal, we built a system along two different methodological lines -a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) grapheme based model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech. Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.3%. Our analysis shows that both models perform well on important medical utterances and therefore can be practical for transcribing medical conversations.

show abstract

Developmental research on an interactive application for language speaking practice using speech recognition technology

Song

2021

Education Tech Research Dev

View full text Add to dashboard Cite

Optimizing Expected Word Error Rate via Sampling for Speech Recognition

Cited by 42 publications

References 25 publications

Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Speech Recognition for Medical Conversations

Developmental research on an interactive application for language speaking practice using speech recognition technology

Contact Info

Product

Resources

About