Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
The syllable-based automatic speech recognition (ASR) systems commonly perform better than the phoneme-based ones. This paper focuses on developing an Indonesian monosyllable-based ASR (MSASR) system using an ASR engine called SPRAAK and comparing it to a phoneme-based one. The Mozilla DeepSpeech-based end-to-end ASR (MDSE2EASR), one of the state-of-the-art models based on character (similar to the phoneme-based model), is also investigated to confirm the result. Besides, a novel Kaituoxu SpeechTransformer (KST) E2EASR is also examined. Testing on the Indonesian speech corpus of 5,439 words shows that the proposed MSASR produces much higher word accuracy (76.57%) than the monophone-based one (63.36%). Its performance is comparable to the character-based MDS-E2EASR, which produces 76.90%, and the character-based KST-E2EASR (78.00%). In the future, this monosyllable-based ASR is possible to be improved to the bisyllable-based one to give higher word accuracy. Nevertheless, extensive bisyllable acoustic models must be handled using an advanced method.
The syllable-based automatic speech recognition (ASR) systems commonly perform better than the phoneme-based ones. This paper focuses on developing an Indonesian monosyllable-based ASR (MSASR) system using an ASR engine called SPRAAK and comparing it to a phoneme-based one. The Mozilla DeepSpeech-based end-to-end ASR (MDSE2EASR), one of the state-of-the-art models based on character (similar to the phoneme-based model), is also investigated to confirm the result. Besides, a novel Kaituoxu SpeechTransformer (KST) E2EASR is also examined. Testing on the Indonesian speech corpus of 5,439 words shows that the proposed MSASR produces much higher word accuracy (76.57%) than the monophone-based one (63.36%). Its performance is comparable to the character-based MDS-E2EASR, which produces 76.90%, and the character-based KST-E2EASR (78.00%). In the future, this monosyllable-based ASR is possible to be improved to the bisyllable-based one to give higher word accuracy. Nevertheless, extensive bisyllable acoustic models must be handled using an advanced method.
Abstract. This paper discusses the usage of the short-term energy contour of speech smoothed by a fuzzy-based method to automatically segment it into syllabic units. Two new additional procedures, local normalization and postprocessing, are proposed to adapt to the Indonesian language. Testing to 220 Indonesian utterances showed that the local normalization significantly improved the performance of the fuzzy-based smoothing. In the postprocessing procedure, splitting and assimilation work in different ways. The splitting of missed short syllables sharply reduced deletion, but slightly increased insertion. On the other hand, the assimilation of a single consonant segment into an expected previous or next segment slightly reduced insertion, but increased deletion. The use of splitting gave a higher accuracy than the assimilation and combined splittingassimilation procedures, since in many cases the assimilation keeps the unexpected insertions and overmerges the expected segments.Keywords: assimilation, fuzzy-based smoothing; Indonesian language; local normalization; short-term energy contour; splitting; syllable segmentation. IntroductionInformation on syllabic units can be used to improve the performance of flat start-based automatic speech recognition (ASR) [1]- [11]. In 2010, Janakiraman et al. [11] reported that incorporating information on syllable boundaries into English ASR reduced both computational complexity and word error rate (WER) significantly compared to flat start ASR. The WER can be reduced from 13% to 4.4% and from 36% to 21.2% for TIMIT and NTIMIT databases respectively.Every language has unique characteristics. For example, English and Indonesian have different syllable patterns. A study of telephone conversations and switchboard corpus by has shown that English has 80% monosyllabic words and 85% of them are simple structures (V, VC, CV, CVC)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.