2023
DOI: 10.1109/access.2023.3275106
|View full text |Cite
|
Sign up to set email alerts
|

A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition

Rishabh Jain,
Andrei Barcovschi,
Mariam Yahayah Yiwere
et al.

Abstract: Despite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models require substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of ch… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(4 citation statements)
references
References 45 publications
0
4
0
Order By: Relevance
“…Central to its success is the concept of selfsupervised learning, a paradigm shift in ASR research. By training on vast amounts of unlabeled speech data [21], Wav2Vec2 transcends the limitations of traditional supervised methods. This innovation not only significantly boosts data efficiency but also empowers the model to grasp the intricacies of diverse accents and languages, making it a versatile tool for multilingual ASR applications.…”
Section: Presentation Of Wav2vec2 Modelmentioning
confidence: 99%
“…Central to its success is the concept of selfsupervised learning, a paradigm shift in ASR research. By training on vast amounts of unlabeled speech data [21], Wav2Vec2 transcends the limitations of traditional supervised methods. This innovation not only significantly boosts data efficiency but also empowers the model to grasp the intricacies of diverse accents and languages, making it a versatile tool for multilingual ASR applications.…”
Section: Presentation Of Wav2vec2 Modelmentioning
confidence: 99%
“…Well-known self-supervised speech models such as Wav2vec series [19,20], Hubert [21], WavLM [22], and MMS (Massively Multilingual Speech) [23] require fine-tuning on domain-specific data to adapt to downstream tasks. Jain et al explored different pretraining and fine-tuning methods for the Wav2vec 2.0 model on the ASR task for child speech [24]. Zhang et al analyzed various combinations of pretraining and fine-tuning on 15 low-resource languages in the OpenASR21 challenge [25].…”
Section: Fine-tuningmentioning
confidence: 99%
“…• Original_220h: Contains 220 hours of original adult speech. • MyST_55h: Contains 55 hours of cleaned MyST child speech, which was prepared according to [56] The Original_12h and Original_220h sets are the original Librispeech (adult speech) counterparts of the Augmented_17h and Augmented_311h sets, respectively. Note that there is an increase in the number of hours of speech data when augmenting from Original_12h to Augmented_17h and from Original_220h to Augmented_311h, in both cases by a factor of 1.41.…”
Section: A Asr Finetuning Datasetsmentioning
confidence: 99%
“…We used four test datasets to test our finetuned models at the inference stage. These datasets were prepared in accordance with our previous research on child speech ASR [56]. Since MyST [30] is the largest child audio corpus available publicly for research use, it was used for both finetuning and inference.…”
Section: A Asr Finetuning Datasetsmentioning
confidence: 99%