Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Cho, Won Ik; Kwak, Donghyun; Yoon, Ji Won; Kim, Nam Soo

doi:10.21437/interspeech.2020-1246

Cited by 17 publications

(8 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data Shortage Scenario. To examine the robustness of model performance to varying training data size, we test our model with a small amount of data as presented in [2,5]. In Table 1, we observe a comparatively marginal performance degradation in ST-BERT for both of data shortage scenarios.…”

Section: Resultsmentioning

confidence: 99%

“…Chung et al [17] learn audio segment representations and word representations individually and aligns their spaces via adversarial training. Other works [5,6] match sequence-level representations of the two modalities using knowledge distillation [18] from a text encoder to a speech encoder. Speech-BERT [19] jointly trains multi-modal representations.…”

Section: Cross-modal Representation Learningmentioning

confidence: 99%

“…This allows the models to fully exploit additional information such as emotion and nuance characterized with acoustic signals. Recently, leveraging large-scale pre-trained language models (PLMs) such as BERT [4] has enhanced SLU performances [5,6] by benefiting from richly learned textual representation. However, these methods exploit only limited textual information by explicitly aligning the spoken utterance and its transcript representations.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Kim¹,

Kim²,

Lee³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Language model pre-training has shown promising results in various downstream tasks. In this context, we introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding (E2E SLU) tasks. Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment via our two proposed pre-training tasks: Cross-modal Masked Language Modeling (CM-MLM) and Cross-modal Conditioned Language Modeling (CM-CLM). Experimental results on three benchmarks present that our approach is effective for various SLU datasets and shows a surprisingly marginal performance degradation even when 1% of the training data are available. Also, our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Cross-modal Representation Learningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Kim¹,

Kim²,

Lee³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Several works use cross-modal distillation approach on SLU [13,14] to exploit textual knowledge. Cho et al [13] use knowledge distillation from a fine-tuned text BERT to an SLU model by making predicted logits for intent classification close to each other in fine-tuning. Denisov and Vu [14] match an utterance embedding and a sentence embeddings of ASR pairs using knowledge distillation as a pre-training.…”

Section: Knowledge Distillation For Slumentioning

confidence: 99%

“…Test accuracy on FSC almost reaches 100%, implying that there is little room to improve and evaluate the newly proposed method's effectiveness. Following [6,13], we simulate a data shortage scenario using only 10% of the speech-text pairs in training. We randomly divide FSC dataset into ten parts and report the average accuracy on them.…”

Section: Datasetmentioning

confidence: 99%

Two-Stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

Kim

Shin

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or finetuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterancelevel representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting. Throughout the ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code is available at https://github.com/clovaai/textual-kd-slu.

show abstract