2019
DOI: 10.48550/arxiv.1904.03670
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Speech Model Pre-training for End-to-End Spoken Language Understanding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

6
127
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 39 publications
(134 citation statements)
references
References 0 publications
6
127
0
Order By: Relevance
“…DeepSpeech is a character level model where the softmax outputs corresponding to the model vocabulary were used as inputs to the intent classification model [3]. Similarly, softmax outputs of an English phoneme recognition system [4] have also been used to build intent recognition systems for Sinhala and Tamil [5].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…DeepSpeech is a character level model where the softmax outputs corresponding to the model vocabulary were used as inputs to the intent classification model [3]. Similarly, softmax outputs of an English phoneme recognition system [4] have also been used to build intent recognition systems for Sinhala and Tamil [5].…”
Section: Related Workmentioning
confidence: 99%
“…The complete statistics are shown in Table 1. For English, we use the largest freely available Fluent Speech Commands (FSC) dataset [4]. The dataset has 248 unique sentences spoken by 97 speakers.…”
Section: Datasetmentioning
confidence: 99%
“…Many benchmark datasets are created to facilitate Spoken Language Understanding (SLU) [41,42,5,43,26,44], which evaluate the robustness of the downstream NLU model against the error output from the upstream acoustic model [7,8,9,10]. However, they are only designed for a particular domain or a specific task such as intent detection and slot filling.…”
Section: Related Workmentioning
confidence: 99%
“…Additionally, since these two models are trained independently, the primary metric of interest (intent classification accuracy) cannot be directly optimized. Due to this problem, end-to-end (E2E) SLU models that directly map a speech signal input to an SLU output have become popular [5]- [10].…”
Section: Introductionmentioning
confidence: 99%
“…This can be addressed by using pretraining to reduce the amount of training data required. For example, researchers have pre-trained models on large ASR datasets such as LibriSpeech [10] [11] to relax audio data requirements, and have used pre-trained BERT networks [12]- [15] to relax text data requirements.…”
Section: Introductionmentioning
confidence: 99%