Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems

Huang, Yinghui; Kuo, Hong-Kwang Jeff; Thomas, Samuel; Kons, Zvi; Audhkhasi, Kartik; Kingsbury, Brian; Hoory, Ron; Picheny, Michael

doi:10.1109/icassp40776.2020.9053281

Cited by 48 publications

(43 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speech to semantics mapping was also defined as a sequence-to-sequence problem in [5,20], but it lacks a mechanism to leverage pre-trained models from either speech or language to further improve the performance of E2E SLU. Speech synthesis is thus explored to generate large training data from multiple artificial speakers to cover the shortage of paired data in E2E SLU [13,20]. The most similar work to ours has been presented in [13,21], where it initializes the speech-to-intent (S2I) model with an ASR model and improves it by leveraging BERT and text-to-speech augmented S2I data.…”

Section: Related Workmentioning

confidence: 99%

“…Speech synthesis is thus explored to generate large training data from multiple artificial speakers to cover the shortage of paired data in E2E SLU [13,20]. The most similar work to ours has been presented in [13,21], where it initializes the speech-to-intent (S2I) model with an ASR model and improves it by leveraging BERT and text-to-speech augmented S2I data. The significant difference between ours and theirs is that they employ sentence-level embedding from both speech and language to jointly classify intent, while we use an attention-based autoregressive generation model jointly trained by using frame-level speech embedding and token level text embedding to generate multiple intents and slots/values given an utterance input.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Bianv

Kanda

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder. The unified speech-language pre-trained model (SLP) is continually enhanced on limited labeled data from a target domain by using a conditional masked language model (MLM) objective, and thus can effectively generate a sequence of intent, slot type, and slot value for given input speech in the inference. The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method. It also outperforms the present state-of-the-art approaches to E2E SLU with much less paired data.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Bianv

Kanda

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…[17] generate synthetic data from text to augment speech training corpus. Another direction is to leverage text data directly during training through multitask learning [18,19,20]. [19] use a common representation space to learn correspondences between different modalities for spoken language understanding.…”

Section: Introductionmentioning

confidence: 99%

“…Another direction is to leverage text data directly during training through multitask learning [18,19,20]. [19] use a common representation space to learn correspondences between different modalities for spoken language understanding. [20] propose multi-modal data augmentation to jointly train text and speech for ASR.…”

Section: Introductionmentioning

confidence: 99%

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Tang

Pino

Wang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Attention-based sequence-to-sequence modeling provides a powerful and elegant solution for applications that need to map one sequence to a different sequence. Its success heavily relies on the availability of large amounts of training data. This presents a challenge for speech applications where labelled speech data is very expensive to obtain, such as automatic speech recognition (ASR) and speech translation (ST). In this study, we propose a general multi-task learning framework to leverage text data for ASR and ST tasks. Two auxiliary tasks, a denoising autoencoder task and machine translation task, are proposed to be co-trained with ASR and ST tasks respectively. We demonstrate that representing text input as phoneme sequences can reduce the difference between speech and text inputs, and enhance the knowledge transfer from text corpora to the speech to text tasks. Our experiments show that the proposed method achieves a relative 10∼15% word error rate reduction on the English LIBRISPEECH task compared with our baseline, and improves the speech translation quality on the MUST-C tasks by 3.6∼9.2 BLEU.

show abstract

“…Recently there has been a significant effort to build end-to-end (E2E) models for spoken language understanding [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Instead of using an ASR system in tandem with a text-based natural language understanding system [2,16,17], these systems directly process speech to produce spoken language understanding (SLU) entity or intent label targets.…”

Section: Introductionmentioning

confidence: 99%

RNN Transducer Models for Spoken Language Understanding

Thomas¹,

Kuo²,

Saon³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding (SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available but not corresponding audio. We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition (ASR) systems, followed by an SLU adaptation step. In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models. When evaluated on two SLU data sets, the ATIS corpus and a customer call center data set, the proposed models closely track the performance of other E2E models and achieve state-of-the-art results.

show abstract

Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems

Cited by 48 publications

References 23 publications

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

RNN Transducer Models for Spoken Language Understanding

Contact Info

Product

Resources

About