Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Prabhavalkar, Rohit; Sainath, Tara N.; Wu, Yonghui; Nguyen, Patrick; Chen, Zhifeng; Chiu, Chung‐Cheng; Kannan, Anjuli

doi:10.1109/icassp.2018.8461809

Cited by 148 publications

(119 citation statements)

References 22 publications

Supporting

Mentioning

113

Contrasting

Order By: Relevance

“…Next, we explore adding LAS rescoring (E6), where LAS is first trained with cross-entropy and then with MWER [30,10]. The RNN-T model is kept unchanged during LAS training.…”

Section: Second-pass Las Rescoringmentioning

confidence: 99%

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Sainath

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

178

127

View full text Add to dashboard Cite

Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains [1] to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.

show abstract

“…Next, we explore adding LAS rescoring (E6), where LAS is first trained with cross-entropy and then with MWER [30,10]. The RNN-T model is kept unchanged during LAS training.…”

Section: Second-pass Las Rescoringmentioning

confidence: 99%

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Sainath

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

178

127

View full text Add to dashboard Cite

show abstract

“…We apply the MWER loss [22] in training which optimizes the expected word error rate by using n-best hypotheses:…”

Section: Mwer Lossmentioning

confidence: 99%

“…B is the beam size. In practice, we combine the MWER loss with cross-entropy (CE) loss to stabilize training: L MWER (x, y * ) = LMWER(x, y * ) + αLCE(x, y * ), where α = 0.01 as in [22].…”

Section: Mwer Lossmentioning

confidence: 99%

Deliberation Model Based Two-Pass End-To-End Speech Recognition

Sainath

Pang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the nonstreaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.

show abstract

“…First, we trained the MoChA models by using connectionist temporal classification (CTC) and crossentropy (CE) losses jointly to learn alignment information precisely. A minimum word error rate (MWER) method, which is a type of sequence-discriminative training, was adopted to optimize the models [10]. Also, for better stability and convergence of model training, we applied a layer-wise pre-training mechanism [11].…”

Section: Introductionmentioning

confidence: 99%

Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus

Kim

Jung

Lee

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pretraining and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.

show abstract

Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Cited by 148 publications

References 22 publications

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Deliberation Model Based Two-Pass End-To-End Speech Recognition

Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus

Contact Info

Product

Resources

About