2020
DOI: 10.48550/arxiv.2011.09044
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

Abstract: End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the 'acoustic' and 'text' embeddings. We pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
8
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(8 citation statements)
references
References 21 publications
0
8
0
Order By: Relevance
“…We obtain significant improvements in performance over the baseline model on the Snips [18] and Fluent Speech Commands (FSC) datasets [10]. To welcome researchers to improve upon our work similar to [10] [14], we are releasing our codebase. 1…”
Section: Introductionmentioning
confidence: 98%
See 3 more Smart Citations
“…We obtain significant improvements in performance over the baseline model on the Snips [18] and Fluent Speech Commands (FSC) datasets [10]. To welcome researchers to improve upon our work similar to [10] [14], we are releasing our codebase. 1…”
Section: Introductionmentioning
confidence: 98%
“…Alternatively, authors in [13] successfully co-train a text-to-intent (T2I) model and speech-to-intent (S2I) model to closely align the acoustic embeddings with BERT-based text embeddings. Further improving on this cross-modal approach, authors in [14] employ the triplet loss function to learn a robust cross-modal latent space. Training the acoustic and text embeddings in the shared latent space makes it easier to combine separate audio and text data.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…SLU systems have traditionally been a cascade of an automatic speech recognition (ASR) system converting speech into text followed by a natural language understanding (NLU) system that interprets the meaning of the text [1][2][3][4]. In contrast, an end-to-end (E2E) SLU system [5][6][7][8][9][10][11][12][13][14] processes speech input directly into meaning without going through an intermediate text transcript.…”
Section: Introductionmentioning
confidence: 99%