A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition

Jain, Rishabh; Barcovschi, Andrei; Yiwere, Mariam Yahayah; Bigioi, Dan; Corcoran, Peter; Cucu, Horia

doi:10.1109/access.2023.3275106

Cited by 11 publications

(4 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Central to its success is the concept of selfsupervised learning, a paradigm shift in ASR research. By training on vast amounts of unlabeled speech data [21], Wav2Vec2 transcends the limitations of traditional supervised methods. This innovation not only significantly boosts data efficiency but also empowers the model to grasp the intricacies of diverse accents and languages, making it a versatile tool for multilingual ASR applications.…”

Section: Presentation Of Wav2vec2 Modelmentioning

confidence: 99%

MDVC corpus: empowering Moroccan Darija speech recognition

Ahmed,

Abdellah

2024

IJEECS

View full text Add to dashboard Cite

Automatic speech recognition (ASR) technology has significantly transformed human-machine interactions, but it remains limited in its representation of diverse languages and dialects. Moroccan Darija, the lively Moroccan dialect, has long been underrepresented in the realm of language technology. To address this gap, we present a novel corpus of audio files accompanied by meticulously transcribed Moroccan Darija speech. The corpus comprises 1,000 hours of diverse content, featuring multiple Moroccan accents, extracted from 80 YouTube channels. To standardize the representation of Moroccan Darija in our corpus, we made efforts to establish consistent writing norms and conventions. In addition to the dataset creation, we applied fine-tuning using the Wav2Vec2 model on the Moroccan Darija voice corpus (MDVC) dataset achieving a remarkable word error rate (WER) of 9%. This article discusses the current state of Moroccan Darija research, highlighting the scarcity of resources and the need for robust ASR systems. Our contribution offers a valuable resource for researchers and developers, and by standardizing the Darija language, we strive to improve ASR system for this low resource language.

show abstract

Section: Presentation Of Wav2vec2 Modelmentioning

confidence: 99%

MDVC corpus: empowering Moroccan Darija speech recognition

Ahmed,

Abdellah

2024

IJEECS

View full text Add to dashboard Cite

show abstract

“…Well-known self-supervised speech models such as Wav2vec series [19,20], Hubert [21], WavLM [22], and MMS (Massively Multilingual Speech) [23] require fine-tuning on domain-specific data to adapt to downstream tasks. Jain et al explored different pretraining and fine-tuning methods for the Wav2vec 2.0 model on the ASR task for child speech [24]. Zhang et al analyzed various combinations of pretraining and fine-tuning on 15 low-resource languages in the OpenASR21 challenge [25].…”

Section: Fine-tuningmentioning

confidence: 99%

Exploration of Whisper fine-tuning strategies for low-resource ASR

Liu,

Yang,

2024

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Limited data availability remains a significant challenge for Whisper’s low-resource speech recognition performance, falling short of practical application requirements. While previous studies have successfully reduced the recognition error rates of target language speech through fine-tuning, a comprehensive exploration and analysis of Whisper’s fine-tuning capabilities and the advantages and disadvantages of various fine-tuning strategies are still lacking. This paper aims to fill this gap by conducting comprehensive experimental exploration for Whisper’s low-resource speech recognition performance using five fine-tuning strategies with limited supervised data from seven low-resource languages. The results and analysis demonstrate that all fine-tuning strategies explored in this paper significantly enhance Whisper’s performance. However, different strategies vary in their suitability and practical effectiveness, highlighting the need for careful selection based on specific use cases and resources available.

show abstract

“…• Original_220h: Contains 220 hours of original adult speech. • MyST_55h: Contains 55 hours of cleaned MyST child speech, which was prepared according to [56] The Original_12h and Original_220h sets are the original Librispeech (adult speech) counterparts of the Augmented_17h and Augmented_311h sets, respectively. Note that there is an increase in the number of hours of speech data when augmenting from Original_12h to Augmented_17h and from Original_220h to Augmented_311h, in both cases by a factor of 1.41.…”

Section: A Asr Finetuning Datasetsmentioning

confidence: 99%

“…We used four test datasets to test our finetuned models at the inference stage. These datasets were prepared in accordance with our previous research on child speech ASR [56]. Since MyST [30] is the largest child audio corpus available publicly for research use, it was used for both finetuning and inference.…”

Section: A Asr Finetuning Datasetsmentioning

confidence: 99%