Towards Data Selection on TTS Data for Children’s Speech Recognition

Wang, Wei; Zhou, Zhikai; Lu, Yizhou; Wang, Hongji; Du, Chenpeng; Ye, Qian

doi:10.1109/icassp39728.2021.9413930

Cited by 8 publications

(3 citation statements)

References 20 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of these approaches consist of various data augmentation techniques for increasing the amount of usable training data. Text-to-Speech based data augmentations as introduced by [14] and [17], where ASR models are finetuned using synthetic data, have not shown significant increases in the accuracy of child ASR. Generative Adversarial Network (GAN) based augmentation [18], [19], [20] has also been explored to increase the amount of labeled data with acoustic attributes like those of child speech.…”

Section: A Related Workmentioning

confidence: 99%

A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition

Jain,

Barcovschi,

Yiwere

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Despite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models require substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training data, adult speech data, and a combination of both, to discover the optimum amount of data required to finetune the model for the task of child ASR. Our trained model achieves the best Word Error Rate (WER) of 7.42 on the MyST child speech dataset, 2.91 on the PFSTAR dataset and 12.77 on the CMU KIDS dataset using cleaned variants of each dataset. Our models outperformed the unmodified wav2vec2 BASE 960 on child speech using as little as 10 hours of child speech data in finetuning. The analysis of different types of training data and their effect on inference is provided by using a combination of custom datasets in pretraining, finetuning and inference. These 'cleaned' datasets are provided for use by other researchers to provide comparisons with our results.

show abstract

Section: A Related Workmentioning

confidence: 99%

A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition

Jain,

Barcovschi,

Yiwere

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…While data augmentation has been predominantly used to reduce WER in LVSCR, very few researchers adapt data augmentation to handle OOV in ASR [61,62,63]. These studies address issues related to specific words such as proper nouns.…”

Section: Oov Detection and Recoverymentioning

confidence: 99%

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Murthy¹,

Sitaram²

2023

IJST

View full text Add to dashboard Cite

Objectives: Improving the accuracy of low resource speech recognition in a model trained on only 4 hours of transcribed continuous speech in Kannada language, using data augmentation. Methods: Baseline language model is augmented with unigram counts of words, that are present in the Wikipedia text corpus but absent in the baseline, for initial decoding. Lattice rescoring is then applied using the language model augmented with Wikipedia text. Speech synthesis-based augmentation with multi-speaker syllable-based synthesis, using voices in Kannada and cross-lingual Telugu languages, is employed. We synthesize basic syllables, syllables with consonant conjuncts, and words that contain syllables that are absent in the training speech, for Kannada language. Findings: An overall word error rate (WER) of 9.04% is achieved over a baseline WER of 40.93%. Language model augmentation and lattice rescoring gives an absolute improvement of 16.68%. Applying our method of syllable-based speech synthesis over language model augmentation and rescoring yields a total reduction of 31.89% in WER. The proposed approach of language model augmentation is memory efficient and consumes only 1/8 th the memory required for decoding with Wikipedia augmented language model (2 gigabytes versus 18 gigabytes) while giving comparable WER (22.95% for Wikipedia versus 24.25% for our method). Augmentation with synthesized syllables enhances the ability of the speech recognition model to recognize basic sounds thus improving recognition of out-of-vocabulary words to 90% and in-vocabulary words to 97%. Novelty: We propose novel methods of language model augmentation and synthesis-based augmentation to achieve low WER for a speech recognition model trained on only 4 hours of continuous speech. Obtaining high recognition accuracy (or low WER) for very small speech corpus is a challenge. In this paper, we demonstrate that high accuracy can be achieved using data augmentation for a small corpus-based speech recognition.

show abstract

“…In [14,15,16], adult speech signals are modified using a cycle consistent generative adversarial networks (GAN) to synthetically generate speech data with acoustic attributes similar to child speakers, and the synthetically generated speech is combined with a training set. Synthetic speech signals generated from children's TTS model were added to ASR training to improve performance on children test cases in [17] and [18]. The stochastic feature map-ping (SFM) technique was also explored to transform out-ofdomain adult data to children's speech data in [19].…”

Section: Related Workmentioning

confidence: 99%

Spectral Modification Based Data Augmentation For Improving End-to-End ASR For Children's Speech

Singh¹,

Sailor²,

Bhattacharya³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training a robust Automatic Speech Recognition (ASR) system for children's speech recognition is a challenging task due to inherent differences in acoustic attributes of adult and child speech and scarcity of publicly available children's speech dataset. In this paper, a novel segmental spectrum warping and perturbations in formant energy are introduced, to generate a children-like speech spectrum from that of an adult's speech spectrum. Then, this modified adult spectrum is used as augmented data to improve end-to-end ASR systems for children's speech recognition. The proposed data augmentation methods give 6.5% and 6.1% relative reduction in WER on children dev and test sets respectively, compared to the vocal tract length perturbation (VTLP) baseline system trained on Librispeech 100 hours adult speech dataset. When children's speech data is added in training with Librispeech set, it gives a 3.7 % and 5.1% relative reduction in WER, compared to the VTLP baseline system.

show abstract

Towards Data Selection on TTS Data for Children’s Speech Recognition

Cited by 8 publications

References 20 publications

A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition

A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Spectral Modification Based Data Augmentation For Improving End-to-End ASR For Children's Speech

Contact Info

Product

Resources

About