Elastic spectral distortion for low resource speech recognition with deep neural networks

Kanda, Naoyuki; Takeda, Ryo; Obuchi, Yasunari

doi:10.1109/asru.2013.6707748

Cited by 102 publications

(60 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Suspicion about simulated data is common in the speech processing community, due for instance to the misleadingly high performance of direction-of-arrival based adaptive beamformers on simulated data compared to real data (Kumatani et al, 2012). Fortunately, this case against simulation does not arise for all techniques: most modern enhancement and ASR techniques can benefit from data augmentation and simulation (Kanda et al, 2013;Brutti and Matassoni, 2016). Few existing datasets involve both real and simulated data.…”

Section: Introductionmentioning

confidence: 99%

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Vincent

Watanabe

Nugraha

et al. 2017

Computer Speech & Language

292

184

View full text Add to dashboard Cite

Section: Introductionmentioning

confidence: 99%

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Vincent

Watanabe

Nugraha

et al. 2017

Computer Speech & Language

292

184

View full text Add to dashboard Cite

“…We conjecture that the spectrogram of a noise segment may be a better domain to apply perturbation. A recent study has found that three perturbations on speech samples in the spectrogram domain improve ASR performance (Kanda et al, 2013). These perturbations were used to expand the speech samples so that more speech patterns are observed by a classifier.…”

Section: Noise Perturbationmentioning

confidence: 99%

“…We use the method described in (Kanda et al, 2013) to randomly perturb noise samples. Frequency perturbation takes three steps.…”

Section: Noise Perturbationmentioning

confidence: 99%

“…We treat noise expansion as a way to prevent a mask estimator from overfitting the training data. A recent study has shown speech perturbation improves ASR (Kanda et al, 2013). However, our study perturbs noise instead of speech since we focus on separating target speech from highly nonstationary noises where the mismatch among noise segments is the major problem.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Noise perturbation for supervised speech separation

Chen

Wang

2016

Speech Communication

View full text Add to dashboard Cite

Speech separation can be treated as a mask estimation problem, where interference-dominant portions are masked in a time-frequency representation of noisy speech. In supervised speech separation, a classifier is typically trained on a mixture set of speech and noise. It is important to efficiently utilize limited training data to make the classifier generalize well. When target speech is severely interfered by a nonstationary noise, a classifier tends to mistake noise patterns for speech patterns. Expansion of a noise through proper perturbation during training helps to expose the classifier to a broader variety of noisy conditions, and hence may lead to better separation performance. This study examines three noise perturbations on supervised speech separation: noise rate, vocal tract length, and frequency perturbation at low signal-to-noise ratios (SNRs). The speech separation performance is evaluated in terms of classification accuracy, hit minus false-alarm rate and short-time objective intelligibility (STOI). The experimental results show that frequency perturbation is the best among the three perturbations in terms of speech separation. In particular, the results show that frequency perturbation is effective in reducing the error of misclassifying a noise pattern as a speech pattern.

show abstract

“…The work was later followed up by [19]- [21] on large vocabulary continuous speech recognition (LVCSR). Similarly, elastic spectral distortion was investigated in [22] where sparse data was augmented by vocal tract length (VTL) distortion, speech rate distortion and frequency-axis random distortion for DNN-HMM training.…”

Section: Introductionmentioning

confidence: 99%

Data Augmentation for Deep Neural Network Acoustic Modeling

Cui

Goel

Kingsbury

2015

IEEE/ACM Trans. Audio Speech Lang. Process.

255

View full text Add to dashboard Cite

This paper investigates data augmentation for deep neural network acoustic modeling based on label-preserving transformations to deal with data sparsity. Two data augmentation approaches, vocal tract length perturbation (VTLP) and stochastic feature mapping (SFM), are investigated for both deep neural networks (DNNs) and convolutional neural networks (CNNs). The approaches are focused on increasing speaker and speech variations of the limited training data such that the acoustic models trained with the augmented data are more robust to such variations. In addition, a two-stage data augmentation scheme based on a stacked architecture is proposed to combine VTLP and SFM as complementary approaches. Experiments are conducted on Assamese and Haitian Creole, two development languages of the IARPA Babel program, and improved performance on automatic speech recognition (ASR) and keyword search (KWS) is reported.Index Terms-Data augmentation, stochastic feature mapping, deep neural networks, automatic speech recognition, keyword search.

show abstract

Elastic spectral distortion for low resource speech recognition with deep neural networks

Cited by 102 publications

References 11 publications

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Noise perturbation for supervised speech separation

Data Augmentation for Deep Neural Network Acoustic Modeling

Contact Info

Product

Resources

About