2021
DOI: 10.48550/arxiv.2104.05379
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

Abstract: Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which work well for large datasets, but tend to overfit when applied in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-tospeech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems. We present a novel approach of silence correction in the data pre-processing for TTS… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 24 publications
0
6
0
Order By: Relevance
“…SpecAugment also is shown to improve classical hybrid HMM models [297]. TTS on the other hand considerably improved attention-based encoder-decoder models trained on limited resources, but did not reach the performance of other E2E approaches or hybrid HMM models, which in turn were not considerably improved by TTS [179]. Multilingual approaches also help improve ASR development for low resource tasks, again both for classical [298], as well as for E2E systems [299], [300].…”
Section: Use Of Large-scale Pretrained Lmsmentioning
confidence: 93%
See 2 more Smart Citations
“…SpecAugment also is shown to improve classical hybrid HMM models [297]. TTS on the other hand considerably improved attention-based encoder-decoder models trained on limited resources, but did not reach the performance of other E2E approaches or hybrid HMM models, which in turn were not considerably improved by TTS [179]. Multilingual approaches also help improve ASR development for low resource tasks, again both for classical [298], as well as for E2E systems [299], [300].…”
Section: Use Of Large-scale Pretrained Lmsmentioning
confidence: 93%
“…Most data augmentation methods perform data perturbation by exploiting certain dimensions of speech signal variation: speed perturbation [172], [173], vocal tract length perturbation [174], [172], frequency axis distortion [172], sequence noise injection [175], SpecAugment [176], or semantic mask [177]. Also, text-only data may be used to generate data using textto-speech (TTS) on feature [178] or signal level [179]. In a comparison of the effect of TTS-based data augmentation on different E2E ASR architectures in [179], AED seemed to be the only architecture that took advantage of data generated by TTS, significantly.…”
Section: G Data Augmentationmentioning
confidence: 99%
See 1 more Smart Citation
“…Most data augmentation methods perform data perturbation by exploiting certain dimensions of speech signal variation: speed perturbation [201], [202], vocal tract length perturbation [201], [203], frequency axis distortion [201], sequence noise injection [204], SpecAugment [205], or semantic mask [206]. Also, text-only data may be used to generate data using textto-speech (TTS) on feature [207] or signal level [208]. In a comparison of the effect of TTS-based data augmentation on different E2E ASR architectures in [208], AED seemed to be the only architecture that appeared to benefit significantly from the TTS data.…”
Section: G Data Augmentationmentioning
confidence: 99%
“…Also, text-only data may be used to generate data using textto-speech (TTS) on feature [207] or signal level [208]. In a comparison of the effect of TTS-based data augmentation on different E2E ASR architectures in [208], AED seemed to be the only architecture that appeared to benefit significantly from the TTS data.…”
Section: G Data Augmentationmentioning
confidence: 99%