Modeling ASR Ambiguity for Neural Dialogue State Tracking

Pal, Vaishali; Guillot, Fabien; Shrivastava, Manish; Renders, Jean-Michel; Besacier, Laurent

doi:10.21437/interspeech.2020-1783

Cited by 5 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[19] shows a two-step approach to generate the best path from confusion network to improve slot filling task. A more recent work [20] studies the approach of using the confusion network in a neural dialogue state tracker (DST), where the authors propose an attentional confusion network encoder that can be used in any DST. On the other hand, there has been work on re-scoring ASR nbest by exploring the morphological, lexical, and syntactic features [21,22].…”

Section: Effect Of Multiple Cnnsmentioning

confidence: 99%

ASR N-Best Fusion Nets

Chen

Wanigasekara

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Current spoken language understanding systems heavily rely on the best hypothesis (ASR 1-best) generated by automatic speech recognition, which is used as the input for downstream models such as natural language understanding (NLU) modules. However, the potential errors and misrecognition in ASR 1-best raise challenges to NLU. It is usually difficult for NLU models to recover from ASR errors without additional signals, which leads to suboptimal SLU performance. This paper proposes a fusion network to jointly consider ASR n-best hypotheses for enhanced robustness to ASR errors. Our experiments on Alexa data show that our model achieved 21.71% error reduction compared to baseline trained on transcription for domain classification.

show abstract

Section: Effect Of Multiple Cnnsmentioning

confidence: 99%

ASR N-Best Fusion Nets

Chen

Wanigasekara

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Word lattices from ASR were first used by [1] over ASR top-1 hypothesis for tasks such as named-entity extraction and call classification. Word confusion networks have been recently used by [4] for intent classification in dialogue systems and by [2,10] for dialogue state tracking (DST). [2] show that confusion network gives comparable performance to top-N hypotheses of ASR while [10] show that using confusion network improves performance in both in time and accuracy.…”

Section: Related Workmentioning

confidence: 99%

“…Word confusion networks have been recently used by [4] for intent classification in dialogue systems and by [2,10] for dialogue state tracking (DST). [2] show that confusion network gives comparable performance to top-N hypotheses of ASR while [10] show that using confusion network improves performance in both in time and accuracy. Another related task in SLU is that of Spoken Question Answering.…”

Section: Related Workmentioning

confidence: 99%

ConfNet2Seq: Full Length Answer Generation from Spoken Questions

Pal,

Shrivastava,

Besacier

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Conversational and task-oriented dialogue systems aim to interact with the user using natural responses through multi-modal interfaces, such as text or speech. These desired responses are in the form of full-length natural answers generated over facts retrieved from a knowledge source. While the task of generating natural answers to questions from an answer span has been widely studied, there has been little research on natural sentence generation over spoken content. We propose a novel system to generate full length natural language answers from spoken questions and factoid answers. The spoken sequence is compactly represented as a confusion network extracted from a pre-trained Automatic Speech Recognizer. This is the first attempt towards generating full-length natural answers from a graph input(confusion network) to the best of our knowledge. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers. Following our proposed approach, we achieve comparable performance with best ASR hypothesis.

show abstract