“…Developed in 2016, Bangla Babel is likewise a telephone conversation-based speech corpus, comprising 215 hours of speech [18]. However, Babel has West Bengal accented speech that is distinct from the Bangladeshi Bangla accent [7]. Ahmed et al prepared 960 hours of broadcast Bangla speech corpus by transcribing speech data in an automated way with pre-trained ASR models [19].…”
Section: Related Work In Banglamentioning
confidence: 99%
“…Bangla is a morphologically rich language from the Indo-Aryan language sub-group. Kibria et al developed SUBAK.KO, an annotated speech corpus for speech recognition research comprising 241 hours of Bangladeshi Bangla speech data, to address the dearth of annotated speech datasets in Bangla [7]. SUBAK.KO contains 229 hours of clean read speech and 12 hours of broadcast speech utterances.…”
Section: Introductionmentioning
confidence: 99%
“…The recording scripts are collected from 40 text domains, including conversations, sports, news, poetry, letters, etc., following the reception and production criteria for the text domains to build the corpus [8]. According to a cross-dataset evaluation, SUBAK.KO is a more balanced corpus with respect to regional accents and other types of speaker variability compared to another large-scale Bangladeshi Bangla speech corpus, LB-ASRTD, when evaluated on clean read speech test sets [7], [9]. The read speech test sets from standard datasets often cannot evaluate the robustness of ASR applications since the same domains are also included in the training set [10].…”
Despite huge improvements in automatic speech recognition (ASR) employing neural networks, ASR systems still suffer from a lack of robustness and generalizability issues due to domain shifting. This is mainly because principal corpus design criteria are often not identified and examined adequately while compiling ASR datasets. In this study, we investigate the robustness of the state-of-theart transfer learning approaches, namely self-supervised wav2vec 2.0 and weakly supervised Whisper, and fully supervised convolutional neural networks (CNNs) for multi-domain ASR. We also demonstrate the significance of domain selection while building a corpus by assessing these models on a novel multidomain Bangladeshi Bangla ASR evaluation benchmark -BanSpeech, which contains approximately 6.52 hours of human-annotated speech and 8085 utterances from 13 distinct domains. SUBAK.KO, a mostly read speech corpus for the morphologically rich language Bangla, has been used to train the ASR systems. Experimental evaluation reveals that self-supervised cross-lingual pre-training with wav2vec 2.0 is the best strategy compared to weak supervision and full supervision to tackle the multi-domain ASR task. Moreover, the ASR models trained on SUBAK.KO face difficulty recognizing speech from domains with mostly spontaneous speech. The BanSpeech will be publicly available to meet the need for a challenging evaluation benchmark for Bangla ASR.
“…Developed in 2016, Bangla Babel is likewise a telephone conversation-based speech corpus, comprising 215 hours of speech [18]. However, Babel has West Bengal accented speech that is distinct from the Bangladeshi Bangla accent [7]. Ahmed et al prepared 960 hours of broadcast Bangla speech corpus by transcribing speech data in an automated way with pre-trained ASR models [19].…”
Section: Related Work In Banglamentioning
confidence: 99%
“…Bangla is a morphologically rich language from the Indo-Aryan language sub-group. Kibria et al developed SUBAK.KO, an annotated speech corpus for speech recognition research comprising 241 hours of Bangladeshi Bangla speech data, to address the dearth of annotated speech datasets in Bangla [7]. SUBAK.KO contains 229 hours of clean read speech and 12 hours of broadcast speech utterances.…”
Section: Introductionmentioning
confidence: 99%
“…The recording scripts are collected from 40 text domains, including conversations, sports, news, poetry, letters, etc., following the reception and production criteria for the text domains to build the corpus [8]. According to a cross-dataset evaluation, SUBAK.KO is a more balanced corpus with respect to regional accents and other types of speaker variability compared to another large-scale Bangladeshi Bangla speech corpus, LB-ASRTD, when evaluated on clean read speech test sets [7], [9]. The read speech test sets from standard datasets often cannot evaluate the robustness of ASR applications since the same domains are also included in the training set [10].…”
Despite huge improvements in automatic speech recognition (ASR) employing neural networks, ASR systems still suffer from a lack of robustness and generalizability issues due to domain shifting. This is mainly because principal corpus design criteria are often not identified and examined adequately while compiling ASR datasets. In this study, we investigate the robustness of the state-of-theart transfer learning approaches, namely self-supervised wav2vec 2.0 and weakly supervised Whisper, and fully supervised convolutional neural networks (CNNs) for multi-domain ASR. We also demonstrate the significance of domain selection while building a corpus by assessing these models on a novel multidomain Bangladeshi Bangla ASR evaluation benchmark -BanSpeech, which contains approximately 6.52 hours of human-annotated speech and 8085 utterances from 13 distinct domains. SUBAK.KO, a mostly read speech corpus for the morphologically rich language Bangla, has been used to train the ASR systems. Experimental evaluation reveals that self-supervised cross-lingual pre-training with wav2vec 2.0 is the best strategy compared to weak supervision and full supervision to tackle the multi-domain ASR task. Moreover, the ASR models trained on SUBAK.KO face difficulty recognizing speech from domains with mostly spontaneous speech. The BanSpeech will be publicly available to meet the need for a challenging evaluation benchmark for Bangla ASR.
“…The output state of the hidden layer of the network is limited, so that the nodes of the hidden layer enter the sparse state, and the average output of the nodes of the hidden layer is equal to 0. In this way, the proportion of active nodes is relatively small, and the homogeneity of 3 Wireless Communications and Mobile Computing the characteristics of the nodes of the hidden layer will not occur [16]. The loss function of sparse autoencoder is shown in…”
In order to investigate how to recognize English words and speech corpus, an English vocabulary and English speech recognition model based on deep learning algorithm was proposed. Through recommending key technical problems and solutions based on deep learning algorithm, how to realize the recognition of English vocabulary and speech corpus was investigated. In the research, the accuracy of the method on the English vocabulary and speech corpus recognition based on the deep learning algorithm increased 79% over the previous methods. Combined with the principle of the deep automatic encoder and deep learning algorithm, the research emphasis was on the effects of speech recognition framework for speech corpus. The speech recognition research based on the theory of deep learning not only had a theoretical guidance meaning but also had the use value in the practical application.
“…Almost 200 million people worldwide say Bangla as their first language which is the 4th among all over the world [11]. Bangla natural language processing (BNLP) [12] resources are very less compared to the English language but it is growing very rapidly in some field such as speech recognition [13]- [16], Bangla offensive word recognition [17], but there are more less work for Bangla music and statistics regarding the classification of musical genres have been provided in literature. Now we are aware of improvements made thus far in both this sector and Bangla music.…”
<p>Music has a control over human moods and it can make someone calm or excited. It allows us to feel all emotions we experience. Nowadays, people are often attached with their phones and computers listening to music on Spotify, Soundcloud or any other internet platform. Music Information retrieval plays an important role for music recommendation according to lyrics, pitch, pattern of choices, and genre. In this study, we have tried to recognize the music genre for a better music recommendation system. We have collected an amount of 1820 Bangla songs from six different genres including Adhunik, Rock, Hip hop, Nazrul, Rabindra and Folk music. We have started with some traditional machine learning algorithms having K-Nearest Neighbor, Logistic Regression, Random Forest, Support Vector Machine and Decision Tree but ended up with a deep learning algorithm named Artificial Neural Network with an accuracy of 78% for recognizing music genres from six different genres. All mentioned algorithms are experimented with transformed mel-spectrograms and Mean Chroma Frequency Values of that raw amplitude data. But we found that music Tempo having Beats per Minute value with two previous features present better accuracy.</p>
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.