Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Ye, Qian; Bianv, Ximo; Kanda, Naoyuki; Shen, Leo; Xiao, Zhen; Zeng, Michael

doi:10.1109/icassp39728.2021.9414900

Cited by 16 publications

(6 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Again, fine-tuning time and cost dominate SLU model cost, but it is intended to be done once and for all, and for producing resources that will be made available for avoiding to repeat such process. Additionally, while we fine-tune a model of 315M parameters on FSC data only, that is 14.7 hours of speech, a state-of-the-art model such as [31] pre-trains ASR and BERT-base models, roughly 287M parameters, on 75k hours of speech, then use such models as components in the final SLU system, which is also fine-tuned on the FSC data.…”

Section: Results On English Fscmentioning

confidence: 99%

Toward Low-Cost End-to-End Spoken Language Understanding

Dinarelli¹,

Naguib²,

Portet³

2022

Preprint

View full text Add to dashboard Cite

Recent advances in spoken language understanding benefited from Self-Supervised models trained on large speech corpora. For French, the LeBenchmark project has made such models available and has led to impressive progress on several tasks including spoken language understanding. These advances have a non-negligible cost in terms of computation time and energy consumption. In this paper, we compare several learning strategies trying to reduce such cost while keeping competitive performance. At the same time we propose an extensive analysis where we measure the cost of our models in terms of training time and electric energy consumption, hopefully promoting a comprehensive evaluation procedure. The experiments are performed on the FSC and MEDIA corpora, and show that it is possible to reduce the learning cost while maintaining state-of-the-art performance and using SSL models.

show abstract

Section: Results On English Fscmentioning

confidence: 99%

Toward Low-Cost End-to-End Spoken Language Understanding

Dinarelli¹,

Naguib²,

Portet³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Kim et al (2021) learns multi-modal alignment with two cross-modal pre-training tasks of masked language modeling and conditioned language modeling. Qian et al (2021) unifies a pre-trained ASR encoder for speech and a pre-trained language model encoder for text into a transformer decoder. Sato et al (2022) introduces an adaptation branch to embed acoustic and linguistic information in the same latent space.…”

Section: Related Workmentioning

confidence: 99%

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

Lu,

Huang,

Zheng

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data, especially in the era of data-centric artificial intelligence. However, labeled speech data are usually scarcer and more expensive for collection, compared to textual data. We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. We train a latent synthesizer to convert textual data into an intermediate latent representation of a pre-trained speech model. These pseudo acoustic representations of textual data augment acoustic data for model training. We evaluate LaSyn on low-resource automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an E2E baseline trained on LibriSpeech train-clean-100, with relative word error rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM) and EM-Tree accuracies on STOP respectively. With fewer parameters, the results of LaSyn are competitive to published state-of-the-art works. The results demonstrate the quality of the augmented training data.

show abstract

“…This model is tested for its demonstration of robustness against ASR errors and extraction of semantic meaning in the input sequence. Qian et al proposed to integrate an end-to-end ASR encoder and a pre-trained language model encoder into a transformer decoder for the SLU task [26].…”

Section: Asr-slu-based Intent Classificationmentioning

confidence: 99%

Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System

Kim

Yoon

Jung

2022

Sensors

View full text Add to dashboard Cite

Successful applications of deep learning technologies in the natural language processing domain have improved text-based intent classifications. However, in practical spoken dialogue applications, the users’ articulation styles and background noises cause automatic speech recognition (ASR) errors, and these may lead language models to misclassify users’ intents. To overcome the limited performance of the intent classification task in the spoken dialogue system, we propose a novel approach that jointly uses both recognized text obtained by the ASR model and a given labeled text. In the evaluation phase, only the fine-tuned recognized language model (RLM) is used. The experimental results show that the proposed scheme is effective at classifying intents in the spoken dialogue system containing ASR errors.

show abstract

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Cited by 16 publications

References 28 publications

Toward Low-Cost End-to-End Spoken Language Understanding

Toward Low-Cost End-to-End Spoken Language Understanding

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System

Contact Info

Product

Resources

About