Stop: A Dataset for Spoken Task Oriented Semantic Parsing

Tomasello, Paden; Shrivastava, Akshat; Lazar, Daniel A.; Hsu, Po‐Chun; Le, Duc Van; Sagar, Adithya; Elkahky, Ali; Copet, Jade; Hsu, Wei-Ning; Adi, Yossi; Algayres, Robin; Nguyen, Tu Ahn; Dupoux, Emmanuel; Zettlemoyer, Luke; Mohamed, Abdelrahman

doi:10.1109/slt54892.2023.10022703

Cited by 6 publications

(3 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We present two novel techniques to improve E2E SLU models: 1) a method to encode ASR hypothesis quality and 2) an effective method to integrate these quality information into E2E SLU models. We show accuracy improvements on STOP dataset [16] in the on-device streaming scenario and share the analysis to demonstrate the effectiveness of our approach.…”

Section: Introductionmentioning

confidence: 73%

“…We used the largest public SLU dataset, STOP (Spoken Task Oriented Semantic Parsing) [16] to evaluate our proposed approach. The STOP dataset is based on Task-Oriented Semantic Parsing (TOPv2) [25], a well-known NLU benchmark, that covers 8 different domains including alarm, messaging, music, navigation, timer, weather, reminder, and event.…”

Section: Stop Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Kim¹,

Shrivastava²,

Le³

et al. 2023

Interspeech 2023

View full text Add to dashboard Cite

End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the analysis to demonstrate the effectiveness of our approach.

show abstract

Section: Introductionmentioning

confidence: 73%

Section: Stop Datasetmentioning

confidence: 99%

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Kim¹,

Shrivastava²,

Le³

et al. 2023

Interspeech 2023

View full text Add to dashboard Cite

show abstract

“…Additionally, audio datasets collected by the authors themselves in real environments or situations were observed. Examples include the Chime-Home [ 7 ], a dataset of gunshot audio [ 8 ], one focused on motor sounds [ 9 ], and some specific resources for spoken tasks, such as AudioMNIST [ 10 ] and STOP [ 11 ]. Also, there are audio datasets created through cutting, modifications, and transformations applied to existing datasets, such as SARdB [ 12 ] for audio scenes and Shrutilipi [ 13 ] for automatic speech recognition.…”

Section: Introductionmentioning

confidence: 99%