Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2396
|View full text |Cite
|
Sign up to set email alerts
|

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Abstract: Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-toend SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-toend models without a large amount of training data is difficult. We propose a method to reduce the data requirements of endto-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Flue… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
276
0
2

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 192 publications
(280 citation statements)
references
References 32 publications
2
276
0
2
Order By: Relevance
“…With nearly 100 speakers in the dataset, the model needs enough examples to be able to make a distinction between all those speakers to identify the right one. Finally the accuracy results of the multitask model on the train and test experiment in [16] are higher than the results of the model (without pre-training) proposed there.…”
Section: Discussionmentioning
confidence: 67%
See 2 more Smart Citations
“…With nearly 100 speakers in the dataset, the model needs enough examples to be able to make a distinction between all those speakers to identify the right one. Finally the accuracy results of the multitask model on the train and test experiment in [16] are higher than the results of the model (without pre-training) proposed there.…”
Section: Discussionmentioning
confidence: 67%
“…Using the accuracy metric as defined in that paper, the multitask model achieved an accuracy of 97.8% on the test set after training on the partial dataset and 98.1% after training on the full dataset. These results should be compared to the model without pre-training of [16], which reaches an accuracy of 88.9% with the partial dataset and 96.6% with the full dataset.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Nowadays there is a growing research interest in end-to-end systems for various SLU tasks [23][24][25][26][27][28][29][30][31]. In this work, similarly to [26,29], end-to-end training of signal-to-concept models is performed through the recurrent neural network (RNN) architecture and the connectionist temporal classification (CTC) loss function [32] as shown in Figure 1.…”
Section: End-to-end Signal-to-concept Neural Architecturementioning
confidence: 99%
“…Most of recently proposed end-to-end models are based on sequence-tosequence architectures. They were initially applied to speech translation [6,7] and then to SLU tasks where the main goal is to extract the domain and user intent from an utterance, together with some semantic slots [2,5].…”
Section: Introductionmentioning
confidence: 99%