ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683602
|View full text |Cite
|
Sign up to set email alerts
|

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

Abstract: In this paper we proposed a novel Adversarial Training (AT) approach for end-to-end speech recognition using a Criticizing Language Model (CLM). In this way the CLM and the automatic speech recognition (ASR) model can challenge and learn from each other iteratively to improve the performance. Since the CLM only takes the text as input, huge quantities of unpaired text data can be utilized in this approach within end-to-end training. Moreover, AT can be applied to any end-to-end ASR model using any deep-learnin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
40
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 37 publications
(40 citation statements)
references
References 23 publications
0
40
0
Order By: Relevance
“…The inputs are a set of audio files with their corresponding transcriptions as labels, while the outputs are the transcribed sequential texts. To simulate most of the current ASR models in the real world, we created a state-of-the-art hybrid ASR model [35] using the PyTorch-Kaldi Speech Recognition Toolkit [29] and an end-to-end ASR model using the Pytorch implementation [14]. In the preprocessing step, fMLLR features were used to train the ASR model with 24 training epochs.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…The inputs are a set of audio files with their corresponding transcriptions as labels, while the outputs are the transcribed sequential texts. To simulate most of the current ASR models in the real world, we created a state-of-the-art hybrid ASR model [35] using the PyTorch-Kaldi Speech Recognition Toolkit [29] and an end-to-end ASR model using the Pytorch implementation [14]. In the preprocessing step, fMLLR features were used to train the ASR model with 24 training epochs.…”
Section: Methodsmentioning
confidence: 99%
“…We experimentally tuned the batch size, learning rate and optimization function to gain a model with better ASR performance. To mimic the ASR model in the wild, we tuned the parameters until the training accuracy exceeded 80%, similar to the results shown in [14,27]. Additionally, to better contextualize our audit results, we report the overfitting level of the ASR models, defined as the difference between the predictions' Word Error Rate (WER) on the training set and the testing set (Overf itting = W ER train − W ER test ).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In [ 47 ], the authors developed a new network architecture for the discriminator to evaluate the video captions based on visual relevance, language fluency, and coherence, while in [ 26 ], the authors employed a deep convolutional generative adversarial network for human activity recognition. For speech recognition, in [ 48 ], the authors employed a deep speech recognition network trained jointly with a discriminative language model that improves ASR performance. This offers a direction for better utilization of additional text data without the need for a separately trained language model.…”
Section: Related Workmentioning
confidence: 99%
“…The problem of closing the domain gap between ASR output and text input to MT and has been addressed already in the framework of Statistical Machine Translation (SMT), by training SMT systems on automatically transcribed speech [12], or by augmenting SMT translation models with simulated acoustic confusions [13]. In the area of neural sequence- to-sequence learning, similar approaches have been applied to ASR error correction, either directly by monolingual sequence-to-sequence transformation [14], or by adapting the framework of generative adversarial networks to provide a language-model critic to improve ASR [15]. Our work extends these ideas by using the performance improvement of downstream MT as learning signal in self-training of ASR.…”
Section: Related Workmentioning
confidence: 99%