SUBESCO is an audio-only emotional speech corpus for Bangla language. The total duration of the corpus is in excess of 7 hours containing 7000 utterances, and it is the largest emotional speech corpus available for this language. Twenty native speakers participated in the gender-balanced set, each recording of 10 sentences simulating seven targeted emotions. Fifty university students participated in the evaluation of this corpus. Each audio clip of this corpus, except those of Disgust emotion, was validated four times by male and female raters. Raw hit rates and unbiased rates were calculated producing scores above chance level of responses. Overall recognition rate was reported to be above 70% for human perception tests. Kappa statistics and intra-class correlation coefficient scores indicated high-level of inter-rater reliability and consistency of this corpus evaluation. SUBESCO is an Open Access database, licensed under Creative Common Attribution 4.0 International, and can be downloaded free of charge from the web link: https://doi.org/10.5281/zenodo.4526477.
In this study, we have presented a deep learning-based implementation for speech emotion recognition (SER). The system combines a deep convolutional neural network (DCNN) and a bidirectional long-short term memory (BLSTM) network with a time-distributed flatten (TDF) layer. The proposed model has been applied for the recently built audio-only Bangla emotional speech corpus SUBESCO. A series of experiments were carried out to analyze all the models discussed in this paper for baseline, cross-lingual, and multilingual training-testing setups. The experimental results reveal that the model with a TDF layer achieves better performance compared with other state-of-the-art CNN-based SER models which can work on both temporal and sequential representation of emotions. For the cross-lingual experiments, cross-corpus training, multi-corpus training, and transfer learning were employed for the Bangla and English languages using the SUBESCO and RAVDESS datasets. The proposed model has attained a state-of-the-art perceptual efficiency achieving weighted accuracies (WAs) of 86.9%, and 82.7% for the SUBESCO and RAVDESS datasets, respectively.
This article proposes two dynamic Huffman based code generation algorithms, namely Octanary and Hexanary algorithm, for data compression. Faster encoding and decoding process is very important in data compression area. We propose tribit-based (Octanary) and quadbitbased (Hexanary) algorithm and compare the performance with the existing widely used single bit (Binary) and recently introduced dibit (Quaternary) algorithms. The decoding algorithms for the proposed techniques have also been described. After assessing all the results, it is found that the Octanary and the Hexanary techniques perform better than the existing techniques in terms of encoding and decoding speed.
Accented pronunciation variability is one of the key elements that deteriorate the accuracy of the automatic speech recognition (ASR). This article reports the results of the acoustic analysis of the two groups of speakers' variability caused by regional accent in Bangladeshi Bangla. The analysis considers the seven monophthongal and four diphthongal vowels of Bangla to investigate the acoustic characteristics of two groups of single-accent speakers and their correlation on the articulation of the Standard Colloquial Bangladeshi Bangla (SCBB). An accent is the speaker's regional signature and shaped by his/her community and educational background. This study examines both male and female speakers from the Sylhet region, which has one of the extremely deviant dialects in Bangla, and comparatively less deviant speakers from different districts of NorthWest and Middle Part of Bangladesh. Accent-related acoustic features such as pitch slope, formant frequencies, and vowel duration have been considered to examine the prominent characteristics of the accents and to classify the accents from these features. Both gender groups are distinctly analyzed. It has been found that there are significant deviations in formant frequencies and various steepness of the rise/fall in pitch slope within accents of both gender groups. In this study, it has been observed that accent related changes in speech affect the ASR performance. This has emphasized the need for accent-specific acoustic models to handle the speakers from highly deviant dialects as well as considering the accent-affected speakers' variability in the corpora development for robust ASR system in Bangladeshi Bangla.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.