End-to-end architectures for ASR-free spoken language understanding

Palogiannidi, Elisavet; Gkinis, Ioannis; Mastrapas, George; Mizera, Petr; Stafylakis, Themos

doi:10.48550/arxiv.1910.10599

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There has been some work on improving intent classification by utilizing a novel architecture: [13] replaced the soft-max classifier with a capsule network, and showed that it can make efficient use of limited training data. However, their model is a speaker-dependent system and makes use of pre-defined speech commands; [14] Since our main focus in this paper is on the learning algorithm rather than the model architecture, we adopt a simple encoder-decoder architecture similar to that in [4] and [9], illustrated in Figure 1. The choice of a simple architecture also ensures that when comparing our models with SotA results -see section 5 -the relative gain of intent prediction accuracy comes from the training strategy rather than a more advanced architecture.…”

Section: Modeling End-to-end Slumentioning

confidence: 99%

Improving End-to-End Speech-to-Intent Classification with Reptile

Tian¹,

Gorinski

2020

Interspeech 2020

View full text Add to dashboard Cite

End-to-end spoken language understanding (SLU) systems have many advantages over conventional pipeline systems, but collecting in-domain speech data to train an end-to-end system is costly and time consuming. One question arises from this: how to train an end-to-end SLU with limited amounts of data? Many researchers have explored approaches that make use of other related data resources, typically by pre-training parts of the model on high-resource speech recognition. In this paper, we suggest improving the generalization performance of SLU models with a non-standard learning algorithm, Reptile. Though Reptile was originally proposed for model-agnostic meta learning, we argue that it can also be used to directly learn a target task and result in better generalization than conventional gradient descent. In this work, we employ Reptile to the task of end-to-end spoken intent classification. Experiments on four datasets of different languages and domains show improvement of intent prediction accuracy, both when Reptile is used alone and used in addition to pre-training.

show abstract

Section: Modeling End-to-end Slumentioning

confidence: 99%

Improving End-to-End Speech-to-Intent Classification with Reptile

Tian¹,

Gorinski

2020

Interspeech 2020

View full text Add to dashboard Cite

show abstract

“…Following the previous end-to-end SLU papers [4,5,24], we use the Fluent Speech Command (FSC) dataset proposed in [4]. It incorporates 30,874 speech utterances annotated with three slots, namely action, object, and location.…”

Section: Datasetmentioning

confidence: 99%

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

et al. 2020

View full text Add to dashboard Cite

Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized an end-to-end structure that preserves the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pretrained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on the Fluent Speech Command dataset. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation.

show abstract

End-to-end architectures for ASR-free spoken language understanding

Cited by 2 publications

References 0 publications

Improving End-to-End Speech-to-Intent Classification with Reptile

Improving End-to-End Speech-to-Intent Classification with Reptile

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Contact Info

Product

Resources

About