Speech Model Pre-Training for End-to-End Spoken Language Understanding

Lugosch, Loren; Ravanelli, Mirco; Ignoto, Patrick; Tomar, Vikrant Singh; Bengio, Yoshua

doi:10.21437/interspeech.2019-2396

Cited by 192 publications

(280 citation statements)

References 32 publications

Supporting

Mentioning

276

Contrasting

Unclassified

Order By: Relevance

“…With nearly 100 speakers in the dataset, the model needs enough examples to be able to make a distinction between all those speakers to identify the right one. Finally the accuracy results of the multitask model on the train and test experiment in [16] are higher than the results of the model (without pre-training) proposed there.…”

Section: Discussionmentioning

confidence: 67%

“…Using the accuracy metric as defined in that paper, the multitask model achieved an accuracy of 97.8% on the test set after training on the partial dataset and 98.1% after training on the full dataset. These results should be compared to the model without pre-training of [16], which reaches an accuracy of 88.9% with the partial dataset and 96.6% with the full dataset.…”

Section: Methodsmentioning

confidence: 99%

“…most commands have the same structure of sentences. The Fluent Speech Commands dataset by Fluent.ai [15,16] is a larger and more challenging dataset. It comprises 30000 utterances from 97 speakers, used in a smart-home controlling appliance setting, for e.g.…”

Section: Datasetmentioning

confidence: 99%

See 2 more Smart Citations

Multitask Learning with Capsule Networks for Speech-to-Intent Applications

Poncelet

hamme

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Voice controlled applications can be a great aid to society, especially for physically challenged people. However this requires robustness to all kinds of variations in speech. A spoken language understanding system that learns from interaction with and demonstrations from the user, allows the use of such a system in different settings and for different types of speech, even for deviant or impaired speech, while also allowing the user to choose a phrasing. The user gives a command and enters its intent through an interface, after which the model learns to map the speech directly to the right action. Since the effort of the user should be as low as possible, capsule networks have drawn interest due to potentially needing little training data compared to deeper neural networks. In this paper, we show how capsules can incorporate multitask learning, which often can improve the performance of a model when the task is difficult. The basic capsule network will be expanded with a regularisation to create more structure in its output: it learns to identify the speaker of the utterance by forcing the required information into the capsule vectors. To this end we move from a speaker dependent to a speaker independent setting.

show abstract

Section: Discussionmentioning

confidence: 67%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Multitask Learning with Capsule Networks for Speech-to-Intent Applications

Poncelet

hamme

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Nowadays there is a growing research interest in end-to-end systems for various SLU tasks [23][24][25][26][27][28][29][30][31]. In this work, similarly to [26,29], end-to-end training of signal-to-concept models is performed through the recurrent neural network (RNN) architecture and the connectionist temporal classification (CTC) loss function [32] as shown in Figure 1.…”

Section: End-to-end Signal-to-concept Neural Architecturementioning

confidence: 99%

Dialogue History Integration into End-to-End Signal-to-Concept Spoken Language Understanding Systems

Tomashenko

Raymond

Caubrière

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This work investigates the embeddings for representing dialog history in spoken language understanding (SLU) systems. We focus on the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. We proposed to integrate dialogue history into an endto-end signal-to-concept SLU system. The dialog history is represented in the form of dialog history embedding vectors (so-called h-vectors) and is provided as an additional information to end-toend SLU models in order to improve the system performance. Three following types of h-vectors are proposed and experimentally evaluated in this paper: (1) supervised-all embeddings predicting bagof-concepts expected in the answer of the user from the last dialog system response; (2) supervised-freq embeddings focusing on predicting only a selected set of semantic concept (corresponding to the most frequent errors in our experiments); and (3) unsupervised embeddings. Experiments on the MEDIA corpus for the semantic slot filling task demonstrate that the proposed h-vectors improve the model performance.Index Terms-End-to-end models, spoken language understanding (SLU), dialog history, h-vectors, semantic slot filling (SF)

show abstract

“…Most of recently proposed end-to-end models are based on sequence-tosequence architectures. They were initially applied to speech translation [6,7] and then to SLU tasks where the main goal is to extract the domain and user intent from an utterance, together with some semantic slots [2,5].…”

Section: Introductionmentioning

confidence: 99%

A Data Efficient End-to-End Spoken Language Understanding Architecture

Dinarelli

Kapoor

Jabaian

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end architectures have been recently proposed for spoken language understanding (SLU) and semantic parsing. Based on a large amount of data, those models learn jointly acoustic and linguistic-sequential features. Such architectures give very good results in the context of domain, intent and slot detection, their application in a more complex semantic chunking and tagging task is less easy. For that, in many cases, models are combined with an external a language model to enhance their performance.In this paper we introduce a data efficient system which is trained end-to-end, with no additional, pre-trained external module. One key feature of our approach is an incremental training procedure where acoustic, language and semantic models are trained sequentially one after the other. The proposed model has a reasonable size and achieves competitive results with respect to state-of-the-art while using a small training dataset. In particular, we reach 24.02% Concept Error Rate (CER) on MEDIA/test while training on MEDIA/train without any additional data.

show abstract

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Cited by 192 publications

References 32 publications

Multitask Learning with Capsule Networks for Speech-to-Intent Applications

Multitask Learning with Capsule Networks for Speech-to-Intent Applications

Dialogue History Integration into End-to-End Signal-to-Concept Spoken Language Understanding Systems

A Data Efficient End-to-End Spoken Language Understanding Architecture

Contact Info

Product

Resources

About