Deep learning based large vocabulary continuous speech recognition of an under-resourced language Bangladeshi Bangla

Samin, Ahnaf Mozib; Kobir, M. Humayon; Kibria, Shafkat; Rahman, M. Shahidur

doi:10.1250/ast.42.252

Cited by 3 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Samin et al evaluated the quality of a large-scale publicly available LB-ASRTD corpus (229 hours) using deep learning-based approaches by conducting character-wise error analysis [20]. They also found a deep CNN-based acoustic model and a 5-gram Markov Language Model (LM) to be capable of achieving a lower word error rate (WER) on LB-ASRTD.…”

Section: Related Work In Banglamentioning

confidence: 99%

“…In this study, we also use a deep CNN-based model while utilizing a higher number of MFCCs during the input feature extraction and introducing layer normalization in each convolution layer. Based on an acoustic study on a regional accented speech and the character-wise error analysis on LB-ASRTD, the requirement of a new corpus with more speaker variability and character-wise well-balancedness was recommended [20], [21]. Therefore, Kibria et al developed the 241-hour-long publicly available Bangladeshi Bangla SUBAK.KO corpus with the aim of addressing the abovementioned issues of LB-ASRTD [7].…”

Section: Related Work In Banglamentioning

confidence: 99%

“…This is done for fully supervised CNN, weakly supervised Whisper, and selfsupervised wav2vec 2.0. While rescoring with the help of a language model achieves better performance [20], we would like to observe the WERs and CERs of the acoustic models without any interference from the language model. The goal of this study is to investigate the robustness of several ASR systems across multiple domains and evaluate SUBAK.KO on both read and spontaneous speech.…”

Section: Subakko Which Contains 241 Hours Of Transcribedmentioning

confidence: 99%

“…FIGURE 1.We use the same deep CNN architecture like[20] as our baseline in this study. The architecture consists of 20 convolutional layers, each followed by ReLU activation and a dropout layer.…”

mentioning

confidence: 99%

See 3 more Smart Citations

BanSpeech: A Multi-Domain Bangla Speech Recognition Benchmark Toward Robust Performance in Challenging Conditions

Samin,

Kobir,

Rafee

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

Despite huge improvements in automatic speech recognition (ASR) employing neural networks, ASR systems still suffer from a lack of robustness and generalizability issues due to domain shifting. This is mainly because principal corpus design criteria are often not identified and examined adequately while compiling ASR datasets. In this study, we investigate the robustness of the state-of-theart transfer learning approaches, namely self-supervised wav2vec 2.0 and weakly supervised Whisper, and fully supervised convolutional neural networks (CNNs) for multi-domain ASR. We also demonstrate the significance of domain selection while building a corpus by assessing these models on a novel multidomain Bangladeshi Bangla ASR evaluation benchmark -BanSpeech, which contains approximately 6.52 hours of human-annotated speech and 8085 utterances from 13 distinct domains. SUBAK.KO, a mostly read speech corpus for the morphologically rich language Bangla, has been used to train the ASR systems. Experimental evaluation reveals that self-supervised cross-lingual pre-training with wav2vec 2.0 is the best strategy compared to weak supervision and full supervision to tackle the multi-domain ASR task. Moreover, the ASR models trained on SUBAK.KO face difficulty recognizing speech from domains with mostly spontaneous speech. The BanSpeech will be publicly available to meet the need for a challenging evaluation benchmark for Bangla ASR.

show abstract

Section: Related Work In Banglamentioning

confidence: 99%