2004
DOI: 10.1016/j.specom.2004.01.006
|View full text |Cite
|
Sign up to set email alerts
|

Prosodic and other cues to speech recognition failures

Abstract: In spoken dialogue systems, it is important for the system to know how likely a speech recognition hypothesis is to be correct, so it can reject misrecognized user turns, or, in cases where many errors have occurred, change its interaction strategy or switch the caller to a human attendant. We have identified prosodic features which predict more accurately when a recognition hypothesis contains errors than the acoustic confidence scores traditionally used in automatic speech recognition in spoken dialogue syst… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
45
0

Year Published

2005
2005
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 67 publications
(50 citation statements)
references
References 31 publications
(14 reference statements)
2
45
0
Order By: Relevance
“…It turned out that 374 out of 1183 speaker turns were misunderstood by the system (32%). These figures are representative of speaker independent spoken dialogue systems in real life settings (e.g., Hirschberg et al, 2004;Carpenter et al, 2001;Nakano and Hazen, 2003;Walker et al, 1998).…”
Section: Data Collectionmentioning
confidence: 99%
See 2 more Smart Citations
“…It turned out that 374 out of 1183 speaker turns were misunderstood by the system (32%). These figures are representative of speaker independent spoken dialogue systems in real life settings (e.g., Hirschberg et al, 2004;Carpenter et al, 2001;Nakano and Hazen, 2003;Walker et al, 1998).…”
Section: Data Collectionmentioning
confidence: 99%
“…In addition, human speakers also respond in a different vocal style to problematic system prompts than to unproblematic ones: when speech recognition errors occur, they tend to correct these in a hyperarticulate manner (which may be characterized as longer, louder and higher). This generally leads to worse recognition results (Ôspiral errorsÕ), since the standard speech recognizers are trained on normal, non-hyperarticulated speech (Oviatt et al, 1998;Levow, 2002;Hirschberg et al, 2004), although more recent studies suggest that systems become less vulnerable to hyperarticulation (Goldberg et al, 2003). In a similar vein, when speakers respond to a problematic yes-no question, their denials (''no'') share many of the properties typical of hyperarticulate speech, in that they are longer, louder and higher than unproblematic negations .…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…It is a style adaptively employed by speakers to enhance recognition and comprehension by the listener given the listening environment at hand (Lindblom, 1990;Moon & Lindblom, 1994;Lindblom, 1996;Jurafsky, et al, 2001;Aylett & Turk, 2004). It also occurs frequently in human-computer interaction when automatic speech recognizers produce errors; speakers will shift to a hyperarticulated speaking style which, while helpful in human speech communication, can actually limit the success of human-computer interaction given the automatic speech recognizers are not commonly trained on this style (Hirschberg, Litman, & Swerts, 2004). …”
Section: Introductionmentioning
confidence: 99%
“…As a result, lowering word error rate is the focus of SR research which can benefit from analyzing SR errors. SR errors have been examined from various perspectives: linguistic regularity of errors (McKoskey and Boley, 2000), the relationships between linguistic factors and SR performance (Greenberg and Chang, 2000), and the associations of prosodic features with SR errors (Hirschberg et al, 2004). However, little is understood about patterns of errors with regard to ease of detection.…”
Section: Introductionmentioning
confidence: 99%