ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053281
|View full text |Cite
|
Sign up to set email alerts
|

Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems

Abstract: Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-ofthe-art, traditional cascaded SLU system. We perfor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
43
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 48 publications
(43 citation statements)
references
References 23 publications
0
43
0
Order By: Relevance
“…Speech to semantics mapping was also defined as a sequence-to-sequence problem in [5,20], but it lacks a mechanism to leverage pre-trained models from either speech or language to further improve the performance of E2E SLU. Speech synthesis is thus explored to generate large training data from multiple artificial speakers to cover the shortage of paired data in E2E SLU [13,20]. The most similar work to ours has been presented in [13,21], where it initializes the speech-to-intent (S2I) model with an ASR model and improves it by leveraging BERT and text-to-speech augmented S2I data.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Speech to semantics mapping was also defined as a sequence-to-sequence problem in [5,20], but it lacks a mechanism to leverage pre-trained models from either speech or language to further improve the performance of E2E SLU. Speech synthesis is thus explored to generate large training data from multiple artificial speakers to cover the shortage of paired data in E2E SLU [13,20]. The most similar work to ours has been presented in [13,21], where it initializes the speech-to-intent (S2I) model with an ASR model and improves it by leveraging BERT and text-to-speech augmented S2I data.…”
Section: Related Workmentioning
confidence: 99%
“…Speech synthesis is thus explored to generate large training data from multiple artificial speakers to cover the shortage of paired data in E2E SLU [13,20]. The most similar work to ours has been presented in [13,21], where it initializes the speech-to-intent (S2I) model with an ASR model and improves it by leveraging BERT and text-to-speech augmented S2I data. The significant difference between ours and theirs is that they employ sentence-level embedding from both speech and language to jointly classify intent, while we use an attention-based autoregressive generation model jointly trained by using frame-level speech embedding and token level text embedding to generate multiple intents and slots/values given an utterance input.…”
Section: Related Workmentioning
confidence: 99%
“…[17] generate synthetic data from text to augment speech training corpus. Another direction is to leverage text data directly during training through multitask learning [18,19,20]. [19] use a common representation space to learn correspondences between different modalities for spoken language understanding.…”
Section: Introductionmentioning
confidence: 99%
“…Another direction is to leverage text data directly during training through multitask learning [18,19,20]. [19] use a common representation space to learn correspondences between different modalities for spoken language understanding. [20] propose multi-modal data augmentation to jointly train text and speech for ASR.…”
Section: Introductionmentioning
confidence: 99%
“…Recently there has been a significant effort to build end-to-end (E2E) models for spoken language understanding [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Instead of using an ASR system in tandem with a text-based natural language understanding system [2,16,17], these systems directly process speech to produce spoken language understanding (SLU) entity or intent label targets.…”
Section: Introductionmentioning
confidence: 99%