Effect of Prosody Modification on Children's ASR

Shahnawazuddin, Syed; Adiga, Nagaraj; Kathania, Hemant Kumar

doi:10.1109/lsp.2017.2756347

Cited by 33 publications

(14 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second factor that decreases the recognition rate is the speaking rate of the adult and child speakers. The phoneme duration of the children's speakers is longer as compared to the adults [4]. Thus, the speaking duration of the children's speakers is slower than the adult speakers [7,8].…”

Section: Introductionmentioning

confidence: 92%

“…From the literature, it was also found that the pitch of the children is quite different and higher than the adult's speech. This is one of the factors that make children's speech different from adult speech and causes acoustic mismatch [4,5]. The range of the pitch frequency mainly lies between 70 Hz to 255 Hz for the adult speakers whereas for children's pitch frequency ranges usually from 200 Hz to 350 Hz [4][5][6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Usage of Prosody Modification and Acoustic Adaptation for Robust Automatic Speech Recognition (ASR) System

Bhardwaj¹,

Kukreja²,

Singh³

2021

RIA

View full text Add to dashboard Cite

Most of the automatic speech recognition (ASR) systems are trained using adult speech due to the less availability of the children's speech dataset. The speech recognition rate of such systems is very less when tested using the children's speech, due to the presence of the inter-speaker acoustic variabilities between the adults and children's speech. These inter-speaker acoustic variabilities are mainly because of the higher pitch and lower speaking rate of the children. Thus, the main objective of the research work is to increase the speech recognition rate of the Punjabi-ASR system by reducing these inter-speaker acoustic variabilities with the help of prosody modification and speaker adaptive training. The pitch period and duration (speaking rate) of the speech signal can be altered with prosody modification without influencing the naturalness, message of the signal and helps to overcome the acoustic variations present in the adult's and children's speech. The developed Punjabi-ASR system is trained with the help of adult speech and prosody-modified adult speech. This prosody modified speech overcomes the massive need for children's speech for training the ASR system and improves the recognition rate. Results show that prosody modification and speaker adaptive training helps to minimize the word error rate (WER) of the Punjabi-ASR system to 8.79% when tested using children's speech.

show abstract

Section: Introductionmentioning

confidence: 92%

Section: Introductionmentioning

confidence: 99%

Usage of Prosody Modification and Acoustic Adaptation for Robust Automatic Speech Recognition (ASR) System

Bhardwaj¹,

Kukreja²,

Singh³

2021

RIA

View full text Add to dashboard Cite

show abstract

“…In the context of children speech, prosodic features and modifications are well studied [2,11,13,15,16]. Prior work [16] has leveraged similar prosody modifications for data augmentation in children ASR achieving substantial gains in performance.…”

Section: Related Workmentioning

confidence: 99%

“…To alleviate data scarcity, we augment training audio data. Specifically, we compare SpecAugment- [1] and prosody-based [2] data augmentation (section 4). SpecAugment, recently popularized for building a robust ASR, has not been explored for processing children's speech.…”

Section: Introductionmentioning

confidence: 99%

Data Augmentation Using Prosody and False Starts to Recognize Non-Native Children’s Speech

Kathania¹,

Singh²,

Grósz³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children's speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as partial words, which in the test transcriptions leads to unseen partial words. To cope with these two challenges, we investigate a data augmentation-based approach. Firstly, we apply the prosodybased data augmentation to supplement the audio data. Secondly, we simulate false starts by introducing partial-word noise in the language modeling corpora creating new words. Acoustic models trained on prosody-based augmented data outperform the models using the baseline recipe or the SpecAugment-based augmentation. The partial-word noise also helps to improve the baseline language model. Our ASR system, a combination of these schemes, is placed third in the evaluation period and achieves the word error rate of 18.71%. Post-evaluation period, we observe that increasing the amounts of prosody-based augmented data leads to better performance. Furthermore, removing low-confidence-score words from hypotheses can lead to further gains. These two improvements lower the ASR error rate to 17.99%.

show abstract

“…As discussed earlier, the speech data of child speakers differ from the adults due to pitch and speaking rate. In the case of child speaker, formant scaling also occurs due to smaller vocal tract geometry [42,43]. Child speakers have higher formant frequencies than adults.…”

Section: Effect Of Data-augmented Training On Vmd-mfcc Featuresmentioning

confidence: 99%

Adaptive spectral smoothening for development of robust keyword spotting system

Pattanayak

Rout

Pradhan

2019

IET signal process.

View full text Add to dashboard Cite

It is well known that a keyword spotting (KWS) system provides significantly reduced performance in mismatched training and test conditions. In this work, an approach is proposed for reducing the mismatches between the training and test speech due to speaker-related variabilities and environmental noises. In the proposed approach, the variational-mode decomposition is first performed on the short-term magnitude spectra to decompose it into a number of variational mode functions (VMFs) in an adaptive manner. Then, a sufficiently smoothed spectra are reconstructed by selecting only two lower frequency VMFs. When the KWS system is developed by using Mel frequency cepstral coefficients (MFCCs) extracted from the smoothed spectra, a significantly improved performance is observed for pitch and noise mismatched test conditions. To further suppress the mismatches due to the pitch and speaking rate of the speakers, data-augmented training based on explicit prosody modification is performed. The experimental results presented in this study show that data-augmented training further enhances the performance of the developed KWS.

show abstract

Effect of Prosody Modification on Children's ASR

Cited by 33 publications

References 18 publications

Usage of Prosody Modification and Acoustic Adaptation for Robust Automatic Speech Recognition (ASR) System

Usage of Prosody Modification and Acoustic Adaptation for Robust Automatic Speech Recognition (ASR) System

Data Augmentation Using Prosody and False Starts to Recognize Non-Native Children’s Speech

Adaptive spectral smoothening for development of robust keyword spotting system

Contact Info

Product

Resources

About