SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Park, Daniel; Chan, William; Zhang, Yu; Chiu, Chung‐Cheng; Zoph, Barret; Cubuk, Ekin D.; Le, Quoc V.

doi:10.21437/interspeech.2019-2680

Cited by 2,409 publications

(1,294 citation statements)

References 34 publications

Supporting

Mentioning

1,176

Contrasting

Unclassified

Order By: Relevance

“…We used an end-to-end encoder-decoder ASR model with additive attention [26]. The model, training strategy, and associated hyperparameters followed LAS-4-1024 in [27].…”

Section: Data Augmentationmentioning

confidence: 99%

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

Sun

Weiss

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech, with dramatic prosodic variation between tokens. This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples. This is accomplished by discretizing the latent features using vector quantization (VQ), and separately training an autoregressive (AR) prior model over the result. We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes. Experimental results show that the proposed model significantly improves the naturalness in random sample generation. Furthermore, initial experiments demonstrate that randomly sampling from the proposed model can be used as data augmentation to improve the ASR performance.

show abstract

“…We used an end-to-end encoder-decoder ASR model with additive attention [26]. The model, training strategy, and associated hyperparameters followed LAS-4-1024 in [27].…”

Section: Data Augmentationmentioning

confidence: 99%

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

Sun

Weiss

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our approach adapts the idea of masked reconstruction to the speech domain. Our approach can also be viewed as extending the data augmentation technique SpecAugment [13], which was shown to be useful for supervised ASR, to unsupervised representation learning. We begin with a spectrogram representation of the input utterance.…”

Section: Pre-training By Masked Reconstructionmentioning

confidence: 99%

“…In particular, the speech signal is continuous while text is discrete; and speech has much finer granularity than text, such that a single word typically spans a large sequence of contiguous frames. To handle these properties of speech, we take our second inspiration from recent work on speech data augmentation [13], which applies masks to the input in both the time and frequency domains. Thus, rather than randomly masking a certain percentage of frames (as in BERT training), we randomly mask some channels across all time steps of the input sequence, as well as contiguous segments in time.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Wang

Tang

Livescu

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results. We find that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeled and labeled data come from different domains. The gain from pre-training is additive to that of supervised data augmentation.

show abstract

“…The purpose of Clean-DA is to compare data augmentation to using the noisy examples. We experimented with mixup [14] and SpecAugment (time/frequency masking) [22], and adopted the latter as it gave superior performance. The L q loss is designed to be robust against incorrectly-labelled data, and is the approach taken by the authors of FSDnoisy18k [2].…”

Section: Evaluated Systemsmentioning

confidence: 99%

Learning With Out-of-Distribution Data for Audio Classification

Iqbal

Cao

Kong

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In supervised machine learning, the assumption that training data is labelled correctly is not always satisfied. In this paper, we investigate an instance of labelling error for classification tasks in which the dataset is corrupted with out-of-distribution (OOD) instances: data that does not belong to any of the target classes, but is labelled as such. We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method uses an auxiliary classifier, trained on data that is known to be in-distribution, for detection and relabelling. The amount of data required for this is shown to be small. Experiments are carried out on the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin. Comparisons with other noise-robust techniques are similarly encouraging.

show abstract

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Cited by 2,409 publications

References 34 publications

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Learning With Out-of-Distribution Data for Audio Classification

Contact Info

Product

Resources

About