Knowledge Distillation Using Output Errors for Self-attention End-to-end Models

Kim, Ho-Gyeong; Na, Hwidong; Lee, Hoshik; Lee, Ji‐Hyun; Kang, Tae Gyoon; Lee, Min-Joong; Choi, Young Sang

doi:10.1109/icassp.2019.8682775

Cited by 29 publications

(17 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The multi-head transformer-based sequence-to-sequence model [18,2,19] was trained with the tensorflow framework. For the ASR model, the encoder consists of 10 layers of self-attention block with 768 nodes and 3,072 filter size, respectively.…”

Section: Methodsmentioning

confidence: 99%

Partially Overlapped Inference for Long-Form Speech Recognition

Kang

Kim

Lee

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

While the end-to-end speech recognition models show impressive performance on many domains, they have difficulties in decoding long-form utterances. The overlapped inference algorithm with tie-breaking between two parallel hypotheses has been proposed for long-form speech recognition and shows dramatic performance improvements at the expense of double computational costs. In this paper, we propose a more effective way of overlapped inference by aligning partially matched hypotheses. Through the experiment on LibriSpeech dataset, the proposed algorithm showed improved performance with less computational cost compared to the conventional overlapped inference.

show abstract

Section: Methodsmentioning

confidence: 99%

Partially Overlapped Inference for Long-Form Speech Recognition

Kang

Kim

Lee

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The comparison results on TIMIT and LibriSpeech of our proposed method with state-of-the-art methods are shown in Table 7. Takashima et al [37] investigated sequence-level [41] proposed to add an exponential weight coefficient to the sequence-level knowledge distillation method to balance the recognition quality of the teacher model. We chose the DeepSpeech2 as the teacher model, selected two layers of LSTM as the student model, and experimented them with the Sequence-level KD, Essence KD and ERR-KD.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

“…Kim et at. [41] added an exponential weight to the sequence-level knowledge distillation method, which reflects the quality of the teacher model output by the weighing scheme to minimize the knowledge distillation loss function. Meng et al [42] proposed conditional teacher-student framework, in which the student model selectively chooses to learn from either the ground truth labels or the outputs of the teacher model.…”

Section: Related Workmentioning

confidence: 99%

Mutual-learning sequence-level knowledge distillation for automatic speech recognition

Ming

Lei

et al. 2021

Neurocomputing

View full text Add to dashboard Cite

“…We used 4096 word-pieces as output token units. For E2E model training, we used same hyper-parameters in [3]. All experiments used the identical input feature processing to that of [24].…”

Section: Methodsmentioning

confidence: 99%

“…This is because Transformer has an advantage in terms of computation and parallelism over recurrent neural network (RNN) based models. In addition, knowledge distillation has been studied to create parameter efficient models [3,4]. Shallow fusion of the E2E ASR models and external language models (LM) also showed a further improvement in WER [5,6], because external LMs are able to learn more contextual information from abundant text-only data.…”

Section: Introductionmentioning

confidence: 99%

Adaptable Multi-Domain Language Model for Transformer ASR

Lee

Kang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We propose an adapter based multi-domain Transformer based language model (LM) for Transformer ASR. The model consists of a big size common LM and small size adapters. The model can perform multi-domain adaptation with only the small size adapters and its related layers. The proposed model can reuse the full finetuned LM which is fine-tuned using all layers of an original model. The proposed LM can be expanded to new domains by adding about 2% of parameters for a first domain and 13% parameters for after second domain. The proposed model is also effective in reducing the model maintenance cost because it is possible to omit the costly and time-consuming common LM pre-training process. Using proposed adapter based approach, we observed that a general LM with adapter can outperform a dedicated music domain LM in terms of word error rate (WER).

show abstract

Knowledge Distillation Using Output Errors for Self-attention End-to-end Models

Cited by 29 publications

References 10 publications

Partially Overlapped Inference for Long-Form Speech Recognition

Partially Overlapped Inference for Long-Form Speech Recognition

Mutual-learning sequence-level knowledge distillation for automatic speech recognition

Adaptable Multi-Domain Language Model for Transformer ASR

Contact Info

Product

Resources

About