Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1246
|View full text |Cite
|
Sign up to set email alerts
|

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Abstract: Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized an end-to-end structure that preserves the uncertainty information. This further reduces the propagation of speech recogniti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 17 publications
(8 citation statements)
references
References 29 publications
0
8
0
Order By: Relevance
“…Data Shortage Scenario. To examine the robustness of model performance to varying training data size, we test our model with a small amount of data as presented in [2,5]. In Table 1, we observe a comparatively marginal performance degradation in ST-BERT for both of data shortage scenarios.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Data Shortage Scenario. To examine the robustness of model performance to varying training data size, we test our model with a small amount of data as presented in [2,5]. In Table 1, we observe a comparatively marginal performance degradation in ST-BERT for both of data shortage scenarios.…”
Section: Resultsmentioning
confidence: 99%
“…Chung et al [17] learn audio segment representations and word representations individually and aligns their spaces via adversarial training. Other works [5,6] match sequence-level representations of the two modalities using knowledge distillation [18] from a text encoder to a speech encoder. Speech-BERT [19] jointly trains multi-modal representations.…”
Section: Cross-modal Representation Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…Several works use cross-modal distillation approach on SLU [13,14] to exploit textual knowledge. Cho et al [13] use knowledge distillation from a fine-tuned text BERT to an SLU model by making predicted logits for intent classification close to each other in fine-tuning. Denisov and Vu [14] match an utterance embedding and a sentence embeddings of ASR pairs using knowledge distillation as a pre-training.…”
Section: Knowledge Distillation For Slumentioning
confidence: 99%
“…Test accuracy on FSC almost reaches 100%, implying that there is little room to improve and evaluate the newly proposed method's effectiveness. Following [6,13], we simulate a data shortage scenario using only 10% of the speech-text pairs in training. We randomly divide FSC dataset into ten parts and report the average accuracy on them.…”
Section: Datasetmentioning
confidence: 99%