Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system

Ye, Qian; Ubale, Rutuja; Ramanaryanan, Vikram; Lange, Patrick; Suendermann-Oeft, David; Evanini, Keelan; Tsuprun, Eugene

doi:10.1109/asru.2017.8268987

Cited by 65 publications

(56 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Work in progress. end-to-end architectures capable of learning how to map sequences of acoustic features directly to SLU recognition units [5,6,7,8]. SLU units that are typically used are combinations of ASR-level units (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

End-to-End Architectures for ASR-Free Spoken Language Understanding

Palogiannidi¹,

Gkinis²,

Mastrapas³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Spoken Language Understanding (SLU) is the problem of extracting the meaning from speech utterances. It is typically addressed as a two-step problem, where an Automatic Speech Recognition (ASR) model is employed to convert speech into text, followed by a Natural Language Understanding (NLU) model to extract meaning from the decoded text. Recently, end-to-end approaches were emerged, aiming at unifying the ASR and NLU into a single SLU deep neural architecture, trained using combinations of ASR and NLU-level recognition units. In this paper, we explore a set of recurrent architectures for intent classification, tailored to the recently introduced Fluent Speech Commands (FSC) dataset, where intents are formed as combinations of three slots (action, object, and location). We show that by combining deep recurrent architectures with standard data augmentation, state-of-the-art results can be attained, without using ASR-level targets or pretrained ASR models. We also investigate its generalizability to new wordings, and we show that the model can perform reasonably well on wordings unseen during training 1 .

show abstract

Section: Introductionmentioning

confidence: 99%

“…End-to-end SLU architecture Train: (Utterances, Speakers) (115660, 77) Validation: (Utterances, Speakers) (3118, 10) Test: (Utterances, Speakers) (3793, 10) Unique Intents 31 Unique: (Actions, Objects, Locations)(6,14,4) …”

mentioning

confidence: 99%

End-to-End Architectures for ASR-Free Spoken Language Understanding

Palogiannidi¹,

Gkinis²,

Mastrapas³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Nowadays there is a growing research interest in end-to-end systems for various SLU tasks [23][24][25][26][27][28][29][30][31]. In this work, similarly to [26,29], end-to-end training of signal-to-concept models is performed through the recurrent neural network (RNN) architecture and the connectionist temporal classification (CTC) loss function [32] as shown in Figure 1.…”

Section: End-to-end Signal-to-concept Neural Architecturementioning

confidence: 99%

Dialogue History Integration into End-to-End Signal-to-Concept Spoken Language Understanding Systems

Tomashenko

Raymond

Caubrière

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This work investigates the embeddings for representing dialog history in spoken language understanding (SLU) systems. We focus on the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. We proposed to integrate dialogue history into an endto-end signal-to-concept SLU system. The dialog history is represented in the form of dialog history embedding vectors (so-called h-vectors) and is provided as an additional information to end-toend SLU models in order to improve the system performance. Three following types of h-vectors are proposed and experimentally evaluated in this paper: (1) supervised-all embeddings predicting bagof-concepts expected in the answer of the user from the last dialog system response; (2) supervised-freq embeddings focusing on predicting only a selected set of semantic concept (corresponding to the most frequent errors in our experiments); and (3) unsupervised embeddings. Experiments on the MEDIA corpus for the semantic slot filling task demonstrate that the proposed h-vectors improve the model performance.Index Terms-End-to-end models, spoken language understanding (SLU), dialog history, h-vectors, semantic slot filling (SF)

show abstract

“…The use of end-to-end models for spoken language understanding (SLU) is beginning to be given more serious consideration [1][2][3][4]. Whereas conventional SLU uses an automatic speech recognition (ASR) component to transcribe the audio into text and a natural language understanding (NLU) component to map the text to semantics, an end-to-end model maps the audio directly to the semantics [5][6][7]. End-to-end models have several advantages over the conventional SLU setup: they have reduced computational requirements and software implementation complexity, avoid downstream errors due to incorrect transcripts, can have the entire set of model parameters optimized for the ultimate performance criterion (semantic accuracy) as opposed to a surrogate criterion (word error rate), and can take advantage of information present in the speech signal but not in the transcript, such as prosody.…”

Section: Introductionmentioning

confidence: 99%

Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models

Lugosch

Meyer

Nowrouzezahrai

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end models are an attractive new approach to spoken language understanding (SLU) in which the meaning of an utterance is inferred directly from the raw audio without employing the standard pipeline composed of a separately trained speech recognizer and natural language understanding module. The downside of end-to-end SLU is that in-domain speech data must be recorded to train the model. In this paper, we propose a strategy for overcoming this requirement in which speech synthesis is used to generate a large synthetic training dataset from several artificial speakers. Experiments on two open-source SLU datasets confirm the effectiveness of our approach, both as a sole source of training data and as a form of data augmentation.Index Termsspoken language understanding, speech synthesis, speech recognition, end-to-end spoken language understanding, backtranslation.

show abstract

Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system

Cited by 65 publications

References 28 publications

End-to-End Architectures for ASR-Free Spoken Language Understanding

End-to-End Architectures for ASR-Free Spoken Language Understanding

Dialogue History Integration into End-to-End Signal-to-Concept Spoken Language Understanding Systems

Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models

Contact Info

Product

Resources

About