Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning

Ritchie, Sandy; Cheng, Yiwei; Chen, Mingqing; Mathews, Rajiv; Esch, Daan van; Li, Bo; Sim, Khe Chai

doi:10.48550/arxiv.2208.03067

Cited by 1 publication

(2 citation statements)

References 29 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the MCV dataset evolved through multiple versions, there have been several studies and experimental results (Ritchie et al, 2022;Ravanelli et al, 2021;Kuchaiev et al, 2019) reported on the dataset. The best results are generally obtained by fine-tuning pre-trained models such as wav2.vec2.0 (Baevski et al, 2020) which are typically pre-trained on large English-only or multilingual speech data.…”

Section: Related Workmentioning

confidence: 99%

“…Recent advances in deep learning techniques for end-to-end speech recognition and the availability of open source frameworks and datasets allow us to empirically explore different ways to improve ASR performance for Kinyarwanda. While recent experimental reports and studies (Ravanelli et al, 2021;Ritchie et al, 2022) have shown improvement in ASR for Kinyarwanda, mostly via selfsupervised pre-training (Self-PT) representations such as wav2vec2.0 (Baevski et al, 2020), there haven't been exploration of using Kinyarwandaonly speech data for Self-PT pre-training and how to improve performance beyond using Self-PT representations. In this work, we report empirical experiments showing how ASR performance for Kinyarwanda can be improved though Self-PT pre-arXiv:2308.11863v1 [eess.AS] 23 Aug 2023 training on Kinyarwanda-only speech data, following a simple curriculum learning schedule during fine-tuning and using semi-supervised learning (Semi-SL) to leverage large unlabelled data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

KINLP at SemEval-2023 Task 12: Kinyarwanda Tweet Sentiment Analysis

Antoine¹

2023

Proceedings of the the 17th International Workshop on Semantic Evaluation (SemEval-2023)

View full text Add to dashboard Cite

Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studioquality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in four successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.9% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%