Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2199
|View full text |Cite
|
Sign up to set email alerts
|

Data Augmentation Using Prosody and False Starts to Recognize Non-Native Children’s Speech

Abstract: This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children's speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as partial words, which in the test transcriptions leads to unseen partial words. To cope with these two challenges, we investigate a data augmentation-based approach.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

3
4

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 14 publications
0
8
0
Order By: Relevance
“…To resolve this issue we investigated two types of time-scale modification techniques. The realtime iterative spectrogram inversion with look-ahead (RTISI-LA) algorithm [23,24,25] constructs a high-quality timedomain signal from its short-time magnitude spectrum with varying parameter s from scale 0.65 to 1.85 with step size 0.10. We also investigated synchronized overlap-add fixed synthesis (SOLAFS) [26] based time scale modification.…”
Section: Time-scale Modificationmentioning
confidence: 99%
“…To resolve this issue we investigated two types of time-scale modification techniques. The realtime iterative spectrogram inversion with look-ahead (RTISI-LA) algorithm [23,24,25] constructs a high-quality timedomain signal from its short-time magnitude spectrum with varying parameter s from scale 0.65 to 1.85 with step size 0.10. We also investigated synchronized overlap-add fixed synthesis (SOLAFS) [26] based time scale modification.…”
Section: Time-scale Modificationmentioning
confidence: 99%
“…We then augment the modified data to the original corpora for further system development. To modify the pitch and speaking rate, we have explored Time Scale Modification (TSM) based on Real-Time Iterative Spectrogram Inversion with Look-Ahead (RTISI-LA) algorithm [16,31]. Both these prosody parameters are tunable and we varied the pitch modification factorr s from 0.65 to 1.45 to modify pitch and the speaking rate modification factor α from 0.65 to 1.85 with a step size of 0.10.…”
Section: Prosody Based Data Augmentationmentioning
confidence: 99%
“…So far, few general purpose ASR has focused upon data augmentation using the training speakers only. This can be performed, for example, by prosody modification of the training data [15,16]. This augmentation approach is not much beneficial for ASR system.…”
Section: Introductionmentioning
confidence: 99%
“…Such augmentations are only successful either through real time resource collection or by production of data which is similar to natural speech [8]. Other common approaches for data augmentation are prosody modification and speed/volume perturbation [2,[9][10][11]. Apart from these, in [12] spectral augmentation was used directly on Mel frequency cepstral coefficient (MFCC) features and filter bank features to improve ASR performance.…”
Section: Introductionmentioning
confidence: 99%