Semantic Complexity in End-to-End Spoken Language Understanding

McKenna, Joseph P.; Choudhary, Samridhi; Saxon, Michael; Strimel, Grant P.; Mouchtaris, Athanasios

doi:10.21437/interspeech.2020-2929

Cited by 9 publications

(12 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use two E2E SLU datasets for our experiments -(1) the publicly available Fluent Speech Commands (FSC) and ( 2) an internal SLU dataset. Additionally, we create a "hard test set" to assess model performance in the most demanding scenarios in generalized VA. We use the average n-gram entropy and Minimum Spanning Tree (MST) complexity score as described in [27] to quantify their levels of semantic complexity. Fluent Speech Commands -FSC [21] is an SLU dataset containing 30,043 utterances with a vocabulary of 124 words and 248 unique utterances over 31 intents in home appliance and smart speaker control.…”

Section: Datamentioning

confidence: 99%

“…The SLU task on this dataset is just the intent classification task. It has an average n-gram entropy of 6.9 bits and an average MST complexity score of 0.2 [27].…”

Section: Datamentioning

confidence: 99%

“…Over the last year the state-of-the-art on FSC has progressed to over 99% test set accuracy for several E2E approaches [14][15][16][17][18][19][20]. However, there remains a gap between the capabilities demonstrated thus far and the E2E SLU requirements for a generalized VA [27]. In particular, existing benchmarks focus on tasks with limited semantic complexity and output structural diversity.…”

Section: Introductionmentioning

confidence: 99%

“…This leads to tasks with a long tail of rare utterances containing unique n-grams and specific slot values unseen during training, that are more semantically complex than the tasks tackled in aforementioned benchmark SLU datasets. Differences in semantic complexity across datasets can be assessed using n-gram entropy and utterance embedding MST complexity measures [27]. Furthermore, in generalized VA tasks the output label space is countably infinite, as any arbitrary sequence of words could be a valid slot output.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

End-to-End Spoken Language Understanding for Generalized Voice Assistants

Saxon¹,

Choudhary²,

McKenna³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to developing an E2E model for generalized SLU in commercial voice assistants (VAs). We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels. This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations. This leads to an SLU system that achieves significant improvements over baselines on a complex internal generalized VA dataset with a 43% improvement in accuracy, while still meeting the 99% accuracy benchmark on the popular Fluent Speech Commands dataset. We further evaluate our model on a hard test set, exclusively containing slot arguments unseen in training, and demonstrate a nearly 20% improvement, showing the efficacy of our approach in truly demanding VA scenarios.

show abstract

Section: Datamentioning

confidence: 99%

“…The SLU task on this dataset is just the intent classification task. It has an average n-gram entropy of 6.9 bits and an average MST complexity score of 0.2 [27].…”

Section: Datamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

End-to-End Spoken Language Understanding for Generalized Voice Assistants

Saxon¹,

Choudhary²,

McKenna³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…A Conversational AI is composed by end-to-end spoken language understanding (SLU) models to predict semantics directly from speech [3]. The conventional approach to SLU uses two distinct components to sequentially process a spoken utterance: an automatic speech recognition (ASR) model that transcribes the speech to a text transcript, followed by a natural language understanding (NLU) model that predicts the domain, intent, and entities given the transcript.…”

Section: Introductionmentioning

confidence: 99%

Could a Conversational AI Identify Offensive Language?

et al. 2021

View full text Add to dashboard Cite

In recent years, we have seen a wide use of Artificial Intelligence (AI) applications in the Internet and everywhere. Natural Language Processing and Machine Learning are important sub-fields of AI that have made Chatbots and Conversational AI applications possible. Those algorithms are built based on historical data in order to create language models, however historical data could be intrinsically discriminatory. This article investigates whether a Conversational AI could identify offensive language and it will show how large language models often produce quite a bit of unethical behavior because of bias in the historical data. Our low-level proof-of-concept will present the challenges to detect offensive language in social media and it will discuss some steps to propitiate strong results in the detection of offensive language and unethical behavior using a Conversational AI.

show abstract