Most of the researches in synchronization of audio and text have been focusing on the synchronization at the level of utterance. However, to generate audio books in unstructured language like Thai from live speech, a finer level of synchronization is necessary. We propose an algorithm to synchronize live speech with its corresponding transcription in real time at syllabic unit. The proposed algorithm employs the syllable endpoint detection method and the syllable landmark detection method with bandlimited intensity as features. The experiment was conducted with LOTUS datasets and the results were compared with baseline ASR-based syllable detection. We evaluated our algorithm by measuring its error through error aberration, which is the difference of the actual number of syllables and the detected syllables for each phrase, and found average total error aberration of the proposed algorithm to outperform that of the baseline. The average total error aberrations are 11.54 and 34.21 for the proposed method and the baseline respectively. We also found the reference deviation from our method to be better than that of the baseline as well.
In Thai, tonal information is a crucial component for identifying the lexical meaning of a word. Consequently, Thai tone classification can obviously improve performance of Thai speech recognition system. In this article, we therefore reported our study of Thai tone classification. Based on our investigation, most of Thai tone classification studies relied on statistical machine learning approaches, especially the Artificial Neural Network (ANN)-based approach and the Hidden Markov Model (HMM)-based approach. Although both approaches gave reasonable performances, they had some limitations due to their mathematical models. We therefore introduced a novel approach for Thai tone classification using a Hidden Conditional Random Field (HCRF)based approach. In our study, we also investigated tone configurations involving tone features, frequency scaling and normalization techniques in order to fine-tune performances of Thai tone classification. Experiments were conducted in both isolated word scenario and continuous speech scenario. Results showed that the HCRF-based approach with the feature F_dF_aF, ERB-rate scaling and a z-score normalization technique yielded the highest performance and outperformed a baseline using the ANNbased approach, which had been reported as the best for the Thai tone classification, in both scenarios. The best performance of HCRF-based approach provided the error rate reduction of 10.58% and 12.02% for isolated word scenario and continuous speech scenario respectively when comparing with the best result of baselines.
The Montreal cognitive assessment (MoCA), a widely accepted screening tool for identifying patients with mild cognitive impairment (MCI), includes a language fluency test of verbal functioning; its scores are based on the number of unique correct words produced by the test taker. However, it is possible that unique words may be counted differently for various languages. This study focuses on Thai as a language that differs from English in terms of word combinations. We applied various automatic speech recognition (ASR) techniques to develop an assisted scoring system for the MoCA language fluency test with Thai language support. This was a challenge because Thai is a low-resource language for which domain-specific data are not publicly available, especially speech data from patients with MCIs. Furthermore, the great variety of pronunciation, intonation, tone, and accent of the patients, all of which might differ from healthy controls, bring more complexity to the model. We propose a hybrid time delay neural network hidden Markov model (TDNN-HMM) architecture for acoustic model training to create our ASR system that is robust to environmental noise and to the variation of voice quality impacted by MCI. The LOTUS Thai speech corpus was incorporated into the training set to improve the model’s generalization. A preprocessing algorithm was implemented to reduce the background noise and improve the overall data quality before feeding data into the TDNN-HMM system for automatic word detection and language fluency score calculation. The results show that the TDNN-HMM model in combination with data augmentation using lattice-free maximum mutual information (LF-MMI) objective function provides a word error rate (WER) of 30.77%. To our knowledge, this is the first study to develop an ASR with Thai language support to automate the scoring system of MoCA’s language fluency assessment.
In segment-based speech recognition systems, the could be implemented using various methods including using quality of the segmentation step is a major factor highly affecting dynamic programming techniques to search the composed their accuracies. This paper proposes methods to reduce missing weighted finite state transducer between the segment graph segments caused by boundary insertion errors in segment graphs, and a pronunciation graph derived from the grammar of the which, in the case of Thai, could be generated from a -;. probabilistic segmentation with limited speech resources. recogiion tas teest.Acoustic discontinuities and manners of articulation are used to It is obvious that the quality of the segment graph, which verify boundaries of the segment graph. Segments are added to could be judged based upon how many correctly hypothesized the graph in the case of possible falsely detected boundaries. segments residing in the graph, is a major factor that highly With the proposed insertion error eliminations, the best phonetic affects the recognition accuracy since segmentation errors are recognition accuracy achieved shows a 13.66% error reduction.propagated to the recognition process. In many languages, probabilistic segmentations that construct segment graphs I. INTRODUCTION from the result of first-pass frame-based phonetic recognition Segment-based speech recogitio 1results have been proven to yield good performances. For Segoment-base speech recognition [1] is a pach to sth Thai, segment graphs for such highly accurate segmentation aurtomatic speehrectio pro blemwereah aor ustic algorithms are still prone to errors. This is partially due to the speech signal according to a hypothesized underlying speech lack of speech resources that can be utilazed to trahn acoustic unit, called "Segment" rather than from a fixed-length frame horyl in spe reogiti onesearches.uWith well-tunger as in a more widely-adopted frame-based approach, such as hMs, a pnetic recognition accuracy of approximately the Hidden Markov Model (HMM) -based speech recognition. HoThis technique has many advantages over the frame-based only 500o was achieved when clean speech utterances in the approach. For example, the segment-based approach makes training set of LOTUS corpus [4], the only publicly available fewer conditional independent assumptions between large-vocabulary Thai speech corpus, were used to train the observations, it can be easily designed to support the use of acoustic models and a bigram language model was also used heterogeneous feature vectors and classifiers [2], and it is to constrain the search..setobinertdwtspeech-specific knowledge such This paper aims at improving the quality of the segment easierto be inerae withgraph obtained from a typical HMM-based phonetic as phonetic boundaries -one of important cues for phonetic t . a Jl . phonetI contrasts. In English, MIT's SUMMIT [1], a segment-based recognition by adjusting segment availability in the graphs so speech recognition system has shown to be successful in tha...
The use of Hidden Markov Models (HMM) in many pattern recognition tasks is now very common. Like other pattern recognitions, most Automatic Speech Recognition systems rely on HMM acoustic models. In such systems, recognition performances are significantly affected by their topologies. In this paper, we propose an HMM topology estimation approach for Thai phoneme recognition tasks whose process is divided into 2 stages. First, a set of suitable topologies are constructed by combinations of different objective functions and topology generation methods. Second, a Genetic Algorithm is deployed as the topology selection algorithm which considers global fitness and selects the most suitable topology from the candidates proposed in the previous stage for each phoneme. As a result, the well-trained topology yields a maximum of 4.36% error reduction over predefined left-to-right models. The estimated topologies still work well when the topology estimation was performed on speech utterances whose recording environments differ from the ones recognized.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.