Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2456
|View full text |Cite
|
Sign up to set email alerts
|

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Abstract: A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four languages in a generative manner, including the prediction of lexical fillers. We investigate how the proposed met… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 24 publications
(27 citation statements)
references
References 21 publications
0
27
0
Order By: Relevance
“…Chung et al [17] learn audio segment representations and word representations individually and aligns their spaces via adversarial training. Other works [5,6] match sequence-level representations of the two modalities using knowledge distillation [18] from a text encoder to a speech encoder. Speech-BERT [19] jointly trains multi-modal representations.…”
Section: Cross-modal Representation Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…Chung et al [17] learn audio segment representations and word representations individually and aligns their spaces via adversarial training. Other works [5,6] match sequence-level representations of the two modalities using knowledge distillation [18] from a text encoder to a speech encoder. Speech-BERT [19] jointly trains multi-modal representations.…”
Section: Cross-modal Representation Learningmentioning
confidence: 99%
“…This allows the models to fully exploit additional information such as emotion and nuance characterized with acoustic signals. Recently, leveraging large-scale pre-trained language models (PLMs) such as BERT [4] has enhanced SLU performances [5,6] by benefiting from richly learned textual representation. However, these methods exploit only limited textual information by explicitly aligning the spoken utterance and its transcript representations.…”
Section: Introductionmentioning
confidence: 99%
“…Speech synthesis is thus explored to generate large training data from multiple artificial speakers to cover the shortage of paired data in E2E SLU [13,20]. The most similar work to ours has been presented in [13,21], where it initializes the speech-to-intent (S2I) model with an ASR model and improves it by leveraging BERT and text-to-speech augmented S2I data. The significant difference between ours and theirs is that they employ sentence-level embedding from both speech and language to jointly classify intent, while we use an attention-based autoregressive generation model jointly trained by using frame-level speech embedding and token level text embedding to generate multiple intents and slots/values given an utterance input.…”
Section: Related Workmentioning
confidence: 99%
“…Several works use cross-modal distillation approach on SLU [13,14] to exploit textual knowledge. Cho et al [13] use knowledge distillation from a fine-tuned text BERT to an SLU model by making predicted logits for intent classification close to each other in fine-tuning.…”
Section: Knowledge Distillation For Slumentioning
confidence: 99%
“…Cho et al [13] use knowledge distillation from a fine-tuned text BERT to an SLU model by making predicted logits for intent classification close to each other in fine-tuning. Denisov and Vu [14] match an utterance embedding and a sentence embeddings of ASR pairs using knowledge distillation as a pre-training. Compared to them, we perform knowledge distillation in both pre-training and fine-tuning, meaning that we match sequence-level hidden representations and predicted logits of two modalities.…”
Section: Knowledge Distillation For Slumentioning
confidence: 99%