2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461809
|View full text |Cite
|
Sign up to set email alerts
|

Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models

Abstract: Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the loglikelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER.In the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
113
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 148 publications
(119 citation statements)
references
References 22 publications
2
113
1
Order By: Relevance
“…Next, we explore adding LAS rescoring (E6), where LAS is first trained with cross-entropy and then with MWER [30,10]. The RNN-T model is kept unchanged during LAS training.…”
Section: Second-pass Las Rescoringmentioning
confidence: 99%
“…Next, we explore adding LAS rescoring (E6), where LAS is first trained with cross-entropy and then with MWER [30,10]. The RNN-T model is kept unchanged during LAS training.…”
Section: Second-pass Las Rescoringmentioning
confidence: 99%
“…We apply the MWER loss [22] in training which optimizes the expected word error rate by using n-best hypotheses:…”
Section: Mwer Lossmentioning
confidence: 99%
“…B is the beam size. In practice, we combine the MWER loss with cross-entropy (CE) loss to stabilize training: L MWER (x, y * ) = LMWER(x, y * ) + αLCE(x, y * ), where α = 0.01 as in [22].…”
Section: Mwer Lossmentioning
confidence: 99%
“…First, we trained the MoChA models by using connectionist temporal classification (CTC) and crossentropy (CE) losses jointly to learn alignment information precisely. A minimum word error rate (MWER) method, which is a type of sequence-discriminative training, was adopted to optimize the models [10]. Also, for better stability and convergence of model training, we applied a layer-wise pre-training mechanism [11].…”
Section: Introductionmentioning
confidence: 99%