2019
DOI: 10.48550/arxiv.1910.10599
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-end architectures for ASR-free spoken language understanding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…There has been some work on improving intent classification by utilizing a novel architecture: [13] replaced the soft-max classifier with a capsule network, and showed that it can make efficient use of limited training data. However, their model is a speaker-dependent system and makes use of pre-defined speech commands; [14] Since our main focus in this paper is on the learning algorithm rather than the model architecture, we adopt a simple encoder-decoder architecture similar to that in [4] and [9], illustrated in Figure 1. The choice of a simple architecture also ensures that when comparing our models with SotA results -see section 5 -the relative gain of intent prediction accuracy comes from the training strategy rather than a more advanced architecture.…”
Section: Modeling End-to-end Slumentioning
confidence: 99%
“…There has been some work on improving intent classification by utilizing a novel architecture: [13] replaced the soft-max classifier with a capsule network, and showed that it can make efficient use of limited training data. However, their model is a speaker-dependent system and makes use of pre-defined speech commands; [14] Since our main focus in this paper is on the learning algorithm rather than the model architecture, we adopt a simple encoder-decoder architecture similar to that in [4] and [9], illustrated in Figure 1. The choice of a simple architecture also ensures that when comparing our models with SotA results -see section 5 -the relative gain of intent prediction accuracy comes from the training strategy rather than a more advanced architecture.…”
Section: Modeling End-to-end Slumentioning
confidence: 99%
“…Following the previous end-to-end SLU papers [4,5,24], we use the Fluent Speech Command (FSC) dataset proposed in [4]. It incorporates 30,874 speech utterances annotated with three slots, namely action, object, and location.…”
Section: Datasetmentioning
confidence: 99%