Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Ren, Zeyu; Yolwas, Nurmemet; Slamu, Wushour; Cao, Ronghe; Wang, Huiru

doi:10.3390/s22197319

Cited by 9 publications

(5 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“… Yu et al (2022) combined the biLSTM layer with dimension reduction and showed that they saved up to 0.5 days of processing time on the dataset they analyzed. Ren et al (2022) observed a decrease in the WER score for the LibriSpeech, Common Voice-Turkish, and Common Voice-UZBEK datasets in the ratios of 2.96%, 7.07%, and 7.08%, respectively, by using the proposed feature extraction technique. Oruh, Viriri & Adegun (2022) achieved a 99.36% accuracy on the English digit dataset with the model that they proposed to address the memory bandwidth problem of the LSTM layer.…”

Section: Literature Reviewmentioning

confidence: 86%

Customized deep learning based Turkish automatic speech recognition system supported by language model

Görmez

2024

PeerJ Computer Science

View full text Add to dashboard Cite

Background In today’s world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people’s daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model. Methods In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model’s performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed. Results Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.

show abstract

Section: Literature Reviewmentioning

confidence: 86%

Customized deep learning based Turkish automatic speech recognition system supported by language model

Görmez

2024

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…Although hybrid CTC/attention ASR systems have gained popularity and improved significantly even in low-resource environments, they are rarely used for Central Asian languages like Turkish and Uzbek. Ren et al [ 34 ] proposed a CNN-based feature extractor called Multi-Scale Parallel Convolution (MSPC) that uses different convolution kernel sizes to extract features of different sizes and combined it with bidirectional long short-term memory (Bi-LSTM) to form an encoder structure to boost the end-to-end model’s recognition rate and system robustness. The authors initialized the RNN language model with a fine-tuned pre-trained BERT and incorporated it into the decoding process.…”

Section: Related Workmentioning

confidence: 99%

Development of Language Models for Continuous Uzbek Speech Recognition System

Mukhamadiyev

Mukhiddinov

Khujayorov

et al. 2023

Sensors

View full text Add to dashboard Cite

Automatic speech recognition systems with a large vocabulary and other natural language processing applications cannot operate without a language model. Most studies on pre-trained language models have focused on more popular languages such as English, Chinese, and various European languages, but there is no publicly available Uzbek speech dataset. Therefore, language models of low-resource languages need to be studied and created. The objective of this study is to address this limitation by developing a low-resource language model for the Uzbek language and understanding linguistic occurrences. We proposed the Uzbek language model named UzLM by examining the performance of statistical and neural-network-based language models that account for the unique features of the Uzbek language. Our Uzbek-specific linguistic representation allows us to construct more robust UzLM, utilizing 80 million words from various sources while using the same or fewer training words, as applied in previous studies. Roughly sixty-eight thousand different words and 15 million sentences were collected for the creation of this corpus. The experimental results of our tests on the continuous recognition of Uzbek speech show that, compared with manual encoding, the use of neural-network-based language models reduced the character error rate to 5.26%.

show abstract

“…Despite the growing popularity and advancements in hybrid CTC/attention ASR systems, particularly in low-resource languages, their application to Central Asian languages like Turkish and Uzbek remains limited. Ren et al [44] introduced a novel feature extraction method using CNNs, termed multiscale parallel convolution (MSPC). This technique utilizes convolution kernels of varying sizes to capture features at different scales, combined with a bidirectional long short-term memory (Bi-LSTM) network to boost the accuracy and stability of the end-to-end model.…”

Section: Related Workmentioning

confidence: 99%

Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language

Mukhamadiyev,

Khujayarov,

Cho

2023

Electronics

View full text Add to dashboard Cite

The demand for customer support call centers has surged across various sectors due to the pandemic. Yet, the constraints of round-the-clock human services and fluctuating wait times pose challenges in fully meeting customer needs. In response, there’s a growing need for automated customer service systems that can provide responses tailored to specific domains and in the native languages of customers, particularly in developing nations like Uzbekistan where call center usage is on the rise. Our system, “UzAssistant,” is designed to recognize user voices and accurately present customer issues in standardized Uzbek, as well as vocalize the responses to voice queries. It employs feature extraction and recurrent neural network (RNN)-based models for effective automatic speech recognition, achieving an impressive 96.4% accuracy in real-time tests with 56 participants. Additionally, the system incorporates a sentence similarity assessment method and a text-to-speech (TTS) synthesis feature specifically for the Uzbek language. The TTS component utilizes the WaveNet architecture to convert text into speech in Uzbek.

show abstract

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Cited by 9 publications

References 64 publications

Customized deep learning based Turkish automatic speech recognition system supported by language model

Customized deep learning based Turkish automatic speech recognition system supported by language model

Development of Language Models for Continuous Uzbek Speech Recognition System

Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language

Contact Info

Product

Resources

About