ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414900
|View full text |Cite
|
Sign up to set email alerts
|

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Abstract: End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a tran… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 16 publications
(6 citation statements)
references
References 28 publications
0
4
0
Order By: Relevance
“…Again, fine-tuning time and cost dominate SLU model cost, but it is intended to be done once and for all, and for producing resources that will be made available for avoiding to repeat such process. Additionally, while we fine-tune a model of 315M parameters on FSC data only, that is 14.7 hours of speech, a state-of-the-art model such as [31] pre-trains ASR and BERT-base models, roughly 287M parameters, on 75k hours of speech, then use such models as components in the final SLU system, which is also fine-tuned on the FSC data.…”
Section: Results On English Fscmentioning
confidence: 99%
“…Again, fine-tuning time and cost dominate SLU model cost, but it is intended to be done once and for all, and for producing resources that will be made available for avoiding to repeat such process. Additionally, while we fine-tune a model of 315M parameters on FSC data only, that is 14.7 hours of speech, a state-of-the-art model such as [31] pre-trains ASR and BERT-base models, roughly 287M parameters, on 75k hours of speech, then use such models as components in the final SLU system, which is also fine-tuned on the FSC data.…”
Section: Results On English Fscmentioning
confidence: 99%
“…Kim et al (2021) learns multi-modal alignment with two cross-modal pre-training tasks of masked language modeling and conditioned language modeling. Qian et al (2021) unifies a pre-trained ASR encoder for speech and a pre-trained language model encoder for text into a transformer decoder. Sato et al (2022) introduces an adaptation branch to embed acoustic and linguistic information in the same latent space.…”
Section: Related Workmentioning
confidence: 99%
“…This model is tested for its demonstration of robustness against ASR errors and extraction of semantic meaning in the input sequence. Qian et al proposed to integrate an end-to-end ASR encoder and a pre-trained language model encoder into a transformer decoder for the SLU task [26].…”
Section: Asr-slu-based Intent Classificationmentioning
confidence: 99%