Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-639
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing Expected Word Error Rate via Sampling for Speech Recognition

Abstract: State-level minimum Bayes risk (sMBR) training has become the de facto standard for sequence-level training of speech recognition acoustic models. It has an elegant formulation using the expectation semiring, and gives large improvements in word error rate (WER) over models trained solely using crossentropy (CE) or connectionist temporal classification (CTC). sMBR training optimizes the expected number of frames at which the reference and hypothesized acoustic states differ. It may be preferable to optimize th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
36
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 42 publications
(36 citation statements)
references
References 25 publications
(37 reference statements)
0
36
0
Order By: Relevance
“…Computing the loss in (4) exactly is intractable since it involves a summation over all possible label sequences. We therefore consider two possible approximations which ensure tractability: approximating the expectation in (4) with samples [3,15], or restricting the summation to an N-best list as is commonly done during sequencetraining for ASR [19].…”
Section: Minimum Word Error Rate Training Of Attention-based Modelsmentioning
confidence: 99%
See 3 more Smart Citations
“…Computing the loss in (4) exactly is intractable since it involves a summation over all possible label sequences. We therefore consider two possible approximations which ensure tractability: approximating the expectation in (4) with samples [3,15], or restricting the summation to an N-best list as is commonly done during sequencetraining for ASR [19].…”
Section: Minimum Word Error Rate Training Of Attention-based Modelsmentioning
confidence: 99%
“…We can approximate the expectation in (4) using an empirical average over samples drawn from the model [15]:…”
Section: Approximation By Samplingmentioning
confidence: 99%
See 2 more Smart Citations
“…In our setup each training utterance is combined with 20 different noises (room reverberations, background music, cafe noises) at a SNR ranging from 5dB to 25dB. We found that best results are obtained when noise is added when using CTC/Cross Entropy (CE) training criteria and original audio is used while doing EMBR training [11].…”
Section: Model Trainingmentioning
confidence: 99%