Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.