Speech Model Pre-training for End-to-End Spoken Language Understanding

Lugosch, Loren; Ravanelli, Mirco; Ignoto, Patrick; Tomar, Vikrant Singh; Bengio, Yoshua

doi:10.48550/arxiv.1904.03670

Cited by 39 publications

(134 citation statements)

References 0 publications

Supporting

Mentioning

127

Contrasting

Order By: Relevance

“…DeepSpeech is a character level model where the softmax outputs corresponding to the model vocabulary were used as inputs to the intent classification model [3]. Similarly, softmax outputs of an English phoneme recognition system [4] have also been used to build intent recognition systems for Sinhala and Tamil [5].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Intent Classification Using Pre-trained Language Agnostic Embeddings For Low Resource Languages

Yadav¹,

Gupta²,

Rallabandi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Building Spoken Language Understanding (SLU) systems that do not rely on language specific Automatic Speech Recognition (ASR) is an important yet less explored problem in language processing. In this paper, we present a comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios. Specifically, we use three different embeddings extracted using Allosaurus, a pre-trained universal phone decoder: (1) Phone (2) Panphone, and (3) Allo embeddings. These embeddings are then used in identifying the spoken intent. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios. Our system improves on the state-of-the-art (SOTA) intent classification accuracy by approximately 2.11% for Sinhala and 7.00% for Tamil and achieves competitive results on English. Furthermore, we present a quantitative analysis of how the performance scales with the number of training examples used per intent.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The complete statistics are shown in Table 1. For English, we use the largest freely available Fluent Speech Commands (FSC) dataset [4]. The dataset has 248 unique sentences spoken by 97 speakers.…”

Section: Datasetmentioning

confidence: 99%

Intent Classification Using Pre-trained Language Agnostic Embeddings For Low Resource Languages

Yadav¹,

Gupta²,

Rallabandi³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Many benchmark datasets are created to facilitate Spoken Language Understanding (SLU) [41,42,5,43,26,44], which evaluate the robustness of the downstream NLU model against the error output from the upstream acoustic model [7,8,9,10]. However, they are only designed for a particular domain or a specific task such as intent detection and slot filling.…”

Section: Related Workmentioning

confidence: 99%

ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding

Feng¹,

Yu²,

Cai³

et al. 2021

Preprint

View full text Add to dashboard Cite

Language understanding in speech-based systems have attracted much attention in recent years with the growing demand for voice interface applications. However, the robustness of natural language understanding (NLU) systems to errors introduced by automatic speech recognition (ASR) is under-examined. In this paper, we propose ASR-GLUE benchmark, a new collection of 6 different NLU tasks for evaluating the performance of models under ASR error across 3 different levels of background noise and 6 speakers with various voice characteristics. Based on the proposed benchmark, we systematically investigate the effect of ASR error on NLU tasks in terms of noise intensity, error type and speaker variants. We further purpose two ways, correction-based method and data augmentation-based method to improve robustness of the NLU systems. Extensive experimental results and analysises show that the proposed methods are effective to some extent, but still far from human performance, demonstrating that NLU under ASR error is still very challenging and requires further research. 1

show abstract

“…Additionally, since these two models are trained independently, the primary metric of interest (intent classification accuracy) cannot be directly optimized. Due to this problem, end-to-end (E2E) SLU models that directly map a speech signal input to an SLU output have become popular [5]- [10].…”

Section: Introductionmentioning

confidence: 99%

“…This can be addressed by using pretraining to reduce the amount of training data required. For example, researchers have pre-trained models on large ASR datasets such as LibriSpeech [10] [11] to relax audio data requirements, and have used pre-trained BERT networks [12]- [15] to relax text data requirements.…”

Section: Introductionmentioning

confidence: 99%

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs

Cha¹,

Hou²,

Jung³

et al. 2021

Preprint

View full text Add to dashboard Cite

A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcripts are accessible. Second, intent-labeled speech data is scarce. To address the first challenge, we propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both. We demonstrate strong performance for either modality separately, and when both speech and ASR transcripts are available, through system combination, we achieve better results than using a single input modality. To address the second challenge, we leverage a semantically robust pre-trained BERT model and adopt a cross-modal system that co-trains text embeddings and acoustic embeddings in a shared latent space. We further enhance this system by utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the text module on our target datasets. Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance on Snips SLU and Fluent Speech Commands datasets.

show abstract

Speech Model Pre-training for End-to-End Spoken Language Understanding

Cited by 39 publications

References 0 publications

Intent Classification Using Pre-trained Language Agnostic Embeddings For Low Resource Languages

Intent Classification Using Pre-trained Language Agnostic Embeddings For Low Resource Languages

ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs

Contact Info

Product

Resources

About