2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2011
DOI: 10.1109/icassp.2011.5947455
|View full text |Cite
|
Sign up to set email alerts
|

Learning non-parametric models of pronunciation

Abstract: As more data becomes available for a given speech recognition task, the natural way to improve recognition accuracy is to train larger models. But, while this strategy yields modest improvements to small systems, the relative gains diminish as the data and models grow. In this paper, we demonstrate that abundant data allows us to model patterns and structure that are unaccounted for in standard systems. In particular, we model the systematic mismatch between the canonical pronunciations of words and the actual… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2012
2012
2016
2016

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 18 publications
0
8
0
Order By: Relevance
“…Our approach is similar, in spirit, to previous work on learning non-parametric pronunciation models [6], learning contextdependent string edit distances [7] and phone-to-word transduction [8]. The novelty of our approach, distinguishing it from previous methods [6,7], lies in the fact that we do not treat q w and q x simply as sequences of phonetic labels: we explicitly exploit timing information associated with phone-boundaries in these sequences, which may be easily obtained as part of the forced-alignment or phonetic decoding process.…”
Section: Introductionmentioning
confidence: 90%
See 1 more Smart Citation
“…Our approach is similar, in spirit, to previous work on learning non-parametric pronunciation models [6], learning contextdependent string edit distances [7] and phone-to-word transduction [8]. The novelty of our approach, distinguishing it from previous methods [6,7], lies in the fact that we do not treat q w and q x simply as sequences of phonetic labels: we explicitly exploit timing information associated with phone-boundaries in these sequences, which may be easily obtained as part of the forced-alignment or phonetic decoding process.…”
Section: Introductionmentioning
confidence: 90%
“…The novelty of our approach, distinguishing it from previous methods [6,7], lies in the fact that we do not treat q w and q x simply as sequences of phonetic labels: we explicitly exploit timing information associated with phone-boundaries in these sequences, which may be easily obtained as part of the forced-alignment or phonetic decoding process. In addition, the proposed approach naturally incorporates long-span context since it is defined in terms of contiguous sequences of phonetic labels -which we term as chunks -similar to the multigram model of Deligne et al [9].…”
Section: Introductionmentioning
confidence: 99%
“…One approach, which was studied heavily especially in the 1990s but also more recently, is to start with a dictionary containing canonical pronunciations and add to it those alternative pronunciations that occur often in some database, or that are generated by deterministic or probabilistic phonetic substitution, insertion, and deletion rules (e.g., Sloboda and Waibel, 1996;Riley et al, 1999;Weintraub et al, 1996b;Strik and Cucchiarini, 1999;FoslerLussier, 1999;Saraçlar and Khudanpur, 2004;Hazen et al, 2005). Other approaches are based on alternative models of transformations between the canonical and observed pronunciations, such as phonetic edit distance models (Hutchinson and Droppo, 2011) and log-linear models with features based on canonical-observed phone string combinations (Zweig and Nguyen, 2009). Efforts to use such ideas in ASR systems have produced performance gains, but not of sufficient magnitude to solve the pronunciation variation problem.…”
Section: Asrmentioning
confidence: 97%
“…For example, phonetic dictionary expansion may affect different systems differently (e.g., possibly achieving greater improvements in a segment-based recognizer [33] than in HMM-based recognizers [30], [10]), but to our knowledge there have been no direct comparisons on identical tasks and data sets. We have also only briefly touched on automatic sub-word unit learning and the related task of automatic dictionary learning [39], [40], [47].…”
Section: Discussionmentioning
confidence: 99%
“…Dictionaries are typically manually generated, but can also be generated in a data-driven way [39], [40].…”
Section: B Impact Of Phonetic Dictionary Expansionmentioning
confidence: 99%