Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Denisov, Pavel; Vu, Ngoc Thang

doi:10.21437/interspeech.2020-2456

Cited by 24 publications

(27 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chung et al [17] learn audio segment representations and word representations individually and aligns their spaces via adversarial training. Other works [5,6] match sequence-level representations of the two modalities using knowledge distillation [18] from a text encoder to a speech encoder. Speech-BERT [19] jointly trains multi-modal representations.…”

Section: Cross-modal Representation Learningmentioning

confidence: 99%

“…This allows the models to fully exploit additional information such as emotion and nuance characterized with acoustic signals. Recently, leveraging large-scale pre-trained language models (PLMs) such as BERT [4] has enhanced SLU performances [5,6] by benefiting from richly learned textual representation. However, these methods exploit only limited textual information by explicitly aligning the spoken utterance and its transcript representations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Kim¹,

Kim²,

Lee³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Language model pre-training has shown promising results in various downstream tasks. In this context, we introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding (E2E SLU) tasks. Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment via our two proposed pre-training tasks: Cross-modal Masked Language Modeling (CM-MLM) and Cross-modal Conditioned Language Modeling (CM-CLM). Experimental results on three benchmarks present that our approach is effective for various SLU datasets and shows a surprisingly marginal performance degradation even when 1% of the training data are available. Also, our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.

show abstract

Section: Cross-modal Representation Learningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Kim¹,

Kim²,

Lee³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Speech synthesis is thus explored to generate large training data from multiple artificial speakers to cover the shortage of paired data in E2E SLU [13,20]. The most similar work to ours has been presented in [13,21], where it initializes the speech-to-intent (S2I) model with an ASR model and improves it by leveraging BERT and text-to-speech augmented S2I data. The significant difference between ours and theirs is that they employ sentence-level embedding from both speech and language to jointly classify intent, while we use an attention-based autoregressive generation model jointly trained by using frame-level speech embedding and token level text embedding to generate multiple intents and slots/values given an utterance input.…”

Section: Related Workmentioning

confidence: 99%

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Bianv

Kanda

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder. The unified speech-language pre-trained model (SLP) is continually enhanced on limited labeled data from a target domain by using a conditional masked language model (MLM) objective, and thus can effectively generate a sequence of intent, slot type, and slot value for given input speech in the inference. The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method. It also outperforms the present state-of-the-art approaches to E2E SLU with much less paired data.

show abstract

“…Several works use cross-modal distillation approach on SLU [13,14] to exploit textual knowledge. Cho et al [13] use knowledge distillation from a fine-tuned text BERT to an SLU model by making predicted logits for intent classification close to each other in fine-tuning.…”

Section: Knowledge Distillation For Slumentioning

confidence: 99%

“…Cho et al [13] use knowledge distillation from a fine-tuned text BERT to an SLU model by making predicted logits for intent classification close to each other in fine-tuning. Denisov and Vu [14] match an utterance embedding and a sentence embeddings of ASR pairs using knowledge distillation as a pre-training. Compared to them, we perform knowledge distillation in both pre-training and fine-tuning, meaning that we match sequence-level hidden representations and predicted logits of two modalities.…”

Section: Knowledge Distillation For Slumentioning

confidence: 99%

Two-Stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

Kim

Shin

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or finetuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterancelevel representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting. Throughout the ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code is available at https://github.com/clovaai/textual-kd-slu.

show abstract

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Cited by 24 publications

References 21 publications

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Two-Stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

Contact Info

Product

Resources

About