Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Cho, Won Ik; Kwak, Donghyun; Yoon, Ji Won; Kim, Nam Soo

doi:10.48550/arxiv.2005.08213

Cited by 3 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lastly, while transcriptions are provided in ATIS and SNIPS, they are not normalized for ASR. Text normalization is applied with an open-source software 10 . For ATIS, utterances are ignored if they contain words with multiple slot labels [59].…”

Section: Methodsmentioning

confidence: 99%

“…Although there is much work on E2E speech enhancement [57], we found that merely augmenting the training data with a diverse set of environmental noises works well. We followed the noise augmentation protocol described in [50], where for each training sample, five noise files are randomly sampled and added to the clean file with SNR levels of [0, 10,20,30,40]dB, resulting in a five-fold data augmentation. Table 3 shows our proposed models trained with noise augmentation.…”

Section: Environmental Noise Augmentationmentioning

confidence: 99%

See 1 more Smart Citation

Towards Semi-Supervised Semantics Understanding from Speech

Lai¹,

Cao²,

Bodapati³

et al. 2020

Preprint

View full text Add to dashboard Cite

Much recent work on Spoken Language Understanding (SLU) falls short in at least one of three ways: models were trained on oracle text input and neglected the Automatics Speech Recognition (ASR) outputs, models were trained to predict only intents without the slot values, or models were trained on a large amount of inhouse data. We proposed a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus. In parallel, we identified two inadequate settings under which SLU models have been tested: noise-robustness and E2E semantics evaluation. We tested the proposed framework under realistic environmental noises and with a new metric, the slots edit F 1 score, on two public SLU corpora. Experiments show that our SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available. * Work performed during an internship at Amazon AI. † Corresponding author. 3 SLU typically consists of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). ASR maps audio to text, and NLU maps text to semantics. Here, we are interested in learning a mapping directly from raw audio to semantics. 4 Semantics is commonly formulated as intent and slots in common benchmarking datasets like ATIS.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Environmental Noise Augmentationmentioning

confidence: 99%

Towards Semi-Supervised Semantics Understanding from Speech

Lai¹,

Cao²,

Bodapati³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

Section: Knowledge Distillation For Slumentioning

confidence: 99%

“…Several works use cross-modal distillation approach on SLU [13,14] to exploit textual knowledge. Cho et al [13] use knowledge distillation from a fine-tuned text BERT to an SLU model by making predicted logits for intent classification close to each other in fine-tuning. Denisov and Vu [14] match an utterance embedding and a sentence embeddings of ASR pairs using knowledge distillation as a pre-training.…”

Section: Knowledge Distillation For Slumentioning

confidence: 99%

See 1 more Smart Citation

Two-stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

Kim

Shin

et al. 2020

Preprint

View full text Add to dashboard Cite

End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or finetuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterancelevel representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting. Throughout the ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code to reproduce our results will be available upon publication.

show abstract

A Speech Enhancement Front-End for Intent Classification in Noisy Environments

Ali

Schmalz

Brutti

et al. 2021

2021 29th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

Intent classification is a fundamental task in the spoken language understanding field that has recently gained the attention of the scientific community, mainly because of the feasibility of approaching it with end-to-end neural models. In this way, avoiding using intermediate steps, i.e. automatic speech recognition, is possible, thus the propagation of errors due to background noise, spontaneous speech, speaking styles of users, etc. Towards the development of solutions applicable in real scenarios, it is interesting to investigate how environmental noise and related noise reduction techniques to address the intent classification task with end-to-end neural models.In this paper, we experiment with a noisy version of the fluent speech command data set, combining the intent classifier with a time-domain speech enhancement solution based on Wave-U-Net and considering different training strategies. Experimental results reveal that, for this task, the use of speech enhancement greatly improves the classification accuracy in noisy conditions, in particular when the classification model is trained on enhanced signals.

show abstract

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Cited by 3 publications

References 0 publications

Towards Semi-Supervised Semantics Understanding from Speech

Towards Semi-Supervised Semantics Understanding from Speech

Two-stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

A Speech Enhancement Front-End for Intent Classification in Noisy Environments

Contact Info

Product

Resources

About