Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

Liu, Alexander H.; Lee, Hung-yi; Lee, Lin-Shan

doi:10.1109/icassp.2019.8683602

Cited by 38 publications

(41 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The inputs are a set of audio files with their corresponding transcriptions as labels, while the outputs are the transcribed sequential texts. To simulate most of the current ASR models in the real world, we created a state-of-the-art hybrid ASR model [35] using the PyTorch-Kaldi Speech Recognition Toolkit [29] and an end-to-end ASR model using the Pytorch implementation [14]. In the preprocessing step, fMLLR features were used to train the ASR model with 24 training epochs.…”

Section: Methodsmentioning

confidence: 99%

“…We experimentally tuned the batch size, learning rate and optimization function to gain a model with better ASR performance. To mimic the ASR model in the wild, we tuned the parameters until the training accuracy exceeded 80%, similar to the results shown in [14,27]. Additionally, to better contextualize our audit results, we report the overfitting level of the ASR models, defined as the difference between the predictions' Word Error Rate (WER) on the training set and the testing set (Overf itting = W ER train − W ER test ).…”

Section: Methodsmentioning

confidence: 99%

“…End-to-End ASR Systems are attention-based encoder-decoder models [14]. Unlike hybrid ASR systems, the end-to-end system predicts sub-word sequences which are converted directly as word sequences.…”

Section: (B)mentioning

confidence: 99%

See 2 more Smart Citations

The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services

Miao

Xue

Chen

et al. 2020

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

With the rapid development of deep learning techniques, the popularity of voice services implemented on various Internet of Things (IoT) devices is ever increasing. In this paper, we examine user-level membership inference in the problem space of voice services, by designing an audio auditor to verify whether a specific user had unwillingly contributed audio used to train an automatic speech recognition (ASR) model under strict black-box access. With user representation of the input audio data and their corresponding translated text, our trained auditor is effective in user-level audit. We also observe that the auditor trained on specific data can be generalized well regardless of the ASR model architecture. We validate the auditor on ASR models trained with LSTM, RNNs, and GRU algorithms on two state-of-the-art pipelines, the hybrid ASR system and the end-to-end ASR system. Finally, we conduct a real-world trial of our auditor on iPhone Siri, achieving an overall accuracy exceeding 80%. We hope the methodology developed in this paper and findings can inform privacy advocates to overhaul IoT privacy.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services

Miao

Xue

Chen

et al. 2020

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

show abstract

“…In [ 47 ], the authors developed a new network architecture for the discriminator to evaluate the video captions based on visual relevance, language fluency, and coherence, while in [ 26 ], the authors employed a deep convolutional generative adversarial network for human activity recognition. For speech recognition, in [ 48 ], the authors employed a deep speech recognition network trained jointly with a discriminative language model that improves ASR performance. This offers a direction for better utilization of additional text data without the need for a separately trained language model.…”

Section: Related Workmentioning

confidence: 99%

Continuous Sign Language Recognition through a Context-Aware Generative Adversarial Network

Papastratis

Dimitropoulos

Daras

2021

Sensors

View full text Add to dashboard Cite

Continuous sign language recognition is a weakly supervised task dealing with the identification of continuous sign gestures from video sequences, without any prior knowledge about the temporal boundaries between consecutive signs. Most of the existing methods focus mainly on the extraction of spatio-temporal visual features without exploiting text or contextual information to further improve the recognition accuracy. Moreover, the ability of deep generative models to effectively model data distribution has not been investigated yet in the field of sign language recognition. To this end, a novel approach for context-aware continuous sign language recognition using a generative adversarial network architecture, named as Sign Language Recognition Generative Adversarial Network (SLRGAN), is introduced. The proposed network architecture consists of a generator that recognizes sign language glosses by extracting spatial and temporal features from video sequences, as well as a discriminator that evaluates the quality of the generator’s predictions by modeling text information at the sentence and gloss levels. The paper also investigates the importance of contextual information on sign language conversations for both Deaf-to-Deaf and Deaf-to-hearing communication. Contextual information, in the form of hidden states extracted from the previous sentence, is fed into the bidirectional long short-term memory module of the generator to improve the recognition accuracy of the network. At the final stage, sign language translation is performed by a transformer network, which converts sign language glosses to natural language text. Our proposed method achieved word error rates of 23.4%, 2.1%, and 2.26% on the RWTH-Phoenix-Weather-2014 and the Chinese Sign Language (CSL) and Greek Sign Language (GSL) Signer Independent (SI) datasets, respectively.

show abstract

“…The problem of closing the domain gap between ASR output and text input to MT and has been addressed already in the framework of Statistical Machine Translation (SMT), by training SMT systems on automatically transcribed speech [12], or by augmenting SMT translation models with simulated acoustic confusions [13]. In the area of neural sequence- to-sequence learning, similar approaches have been applied to ASR error correction, either directly by monolingual sequence-to-sequence transformation [14], or by adapting the framework of generative adversarial networks to provide a language-model critic to improve ASR [15]. Our work extends these ideas by using the performance improvement of downstream MT as learning signal in self-training of ASR.…”

Section: Related Workmentioning

confidence: 99%

Cascaded Models with Cyclic Feedback for Direct Speech Translation

Lam¹,

Schamoni

Riezler

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Direct speech translation describes a scenario where only speech inputs and corresponding translations are available. Such data are notoriously limited. We present a technique that allows cascades of automatic speech recognition (ASR) and machine translation (MT) to exploit in-domain direct speech translation data in addition to out-of-domain MT and ASR data. After pre-training MT and ASR, we use a feedback cycle where the downstream performance of the MT system is used as a signal to improve the ASR system by self-training, and the MT component is fine-tuned on multiple ASR outputs, making it more tolerant towards spelling variations. A comparison to end-to-end speech translation using components of identical architecture and the same data shows gains of up to 3.8 BLEU points on LibriVoxDeEn and up to 5.1 BLEU points on CoVoST for German-to-English speech translation.

show abstract

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

Cited by 38 publications

References 23 publications

The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services

The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services

Continuous Sign Language Recognition through a Context-Aware Generative Adversarial Network

Cascaded Models with Cyclic Feedback for Direct Speech Translation

Contact Info

Product

Resources

About