Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2680
|View full text |Cite
|
Sign up to set email alerts
|

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Abstract: We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outpe… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

7
1,176
3
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 2,409 publications
(1,294 citation statements)
references
References 34 publications
7
1,176
3
1
Order By: Relevance
“…We used an end-to-end encoder-decoder ASR model with additive attention [26]. The model, training strategy, and associated hyperparameters followed LAS-4-1024 in [27].…”
Section: Data Augmentationmentioning
confidence: 99%
“…We used an end-to-end encoder-decoder ASR model with additive attention [26]. The model, training strategy, and associated hyperparameters followed LAS-4-1024 in [27].…”
Section: Data Augmentationmentioning
confidence: 99%
“…Our approach adapts the idea of masked reconstruction to the speech domain. Our approach can also be viewed as extending the data augmentation technique SpecAugment [13], which was shown to be useful for supervised ASR, to unsupervised representation learning. We begin with a spectrogram representation of the input utterance.…”
Section: Pre-training By Masked Reconstructionmentioning
confidence: 99%
“…In particular, the speech signal is continuous while text is discrete; and speech has much finer granularity than text, such that a single word typically spans a large sequence of contiguous frames. To handle these properties of speech, we take our second inspiration from recent work on speech data augmentation [13], which applies masks to the input in both the time and frequency domains. Thus, rather than randomly masking a certain percentage of frames (as in BERT training), we randomly mask some channels across all time steps of the input sequence, as well as contiguous segments in time.…”
Section: Introductionmentioning
confidence: 99%
“…The purpose of Clean-DA is to compare data augmentation to using the noisy examples. We experimented with mixup [14] and SpecAugment (time/frequency masking) [22], and adopted the latter as it gave superior performance. The L q loss is designed to be robust against incorrectly-labelled data, and is the approach taken by the authors of FSDnoisy18k [2].…”
Section: Evaluated Systemsmentioning
confidence: 99%