End-to-End Model Based on RNN-T for Kazakh Speech Recognition

Mamyrbayev, Orken; Оралбекова, Дина; Kydyrbekova, Aizat; Turdalykyzy, Tolganay; Bekarystankyzy, Akbayan

doi:10.1109/iccci51764.2021.9486811

Cited by 11 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of models [ 38 , 39 , 40 , 41 , 42 , 43 , 44 ] have been created for the speech recognition of the Kazakh language. The complexity with regard to Kazakh, its distinctive features, the scarcity of emotional speech datasets, and other factors make it difficult to develop a model for emotional speech detection in this language.…”

Section: Related Workmentioning

confidence: 99%

Continuous Sign Language Recognition and Its Translation into Intonation-Colored Speech

Amangeldy

Ukenova

Bekmanova

et al. 2023

Sensors

View full text Add to dashboard Cite

This article is devoted to solving the problem of converting sign language into a consistent text with intonation markup for subsequent voice synthesis of sign phrases by speech with intonation. The paper proposes an improved method of continuous recognition of sign language, the results of which are transmitted to a natural language processor based on analyzers of morphology, syntax, and semantics of the Kazakh language, including morphological inflection and the construction of an intonation model of simple sentences. This approach has significant practical and social significance, as it can lead to the development of technologies that will help people with disabilities to communicate and improve their quality of life. As a result of the cross-validation of the model, we obtained an average test accuracy of 0.97 and an average val_accuracy of 0.90 for model evaluation. We also identified 20 sentence structures of the Kazakh language with their intonational model.

show abstract

Section: Related Workmentioning

confidence: 99%

Continuous Sign Language Recognition and Its Translation into Intonation-Colored Speech

Amangeldy

Ukenova

Bekmanova

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…In the field of Kazakh speech recognition, Mamyrbayev et al [17] investigated the implementation of an end-to-end model based on RNN-T. They focused on streaming speech recognition, in which the audio stream is directly converted to text in real time.…”

Section: Related Workmentioning

confidence: 99%

The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters

Kadyrbek

Mansurova

Shomanov

et al. 2023

BDCC

View full text Add to dashboard Cite

This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of deep neural networks for speech modeling. A high-quality decoded audio corpus was collected, containing 554 h of data, giving an idea of the frequencies of letters and syllables, as well as demographic parameters such as the gender, age, and region of residence of native speakers. The corpus contains a universal vocabulary and serves as a valuable resource for the development of modules related to speech. Machine learning experiments were conducted using the DeepSpeech2 model, which includes a sequence-to-sequence architecture with an encoder, decoder, and attention mechanism. To increase the reliability of the model, filters initialized with symbol-level embeddings were introduced to reduce the dependence on accurate positioning on object maps. The training process included simultaneous preparation of convolutional filters for spectrograms and symbolic objects. The proposed approach, using a combination of supervised and unsupervised learning methods, resulted in a 66.7% reduction in the weight of the model while maintaining relative accuracy. The evaluation on the test sample showed a 7.6% lower character error rate (CER) compared to existing models, demonstrating its most modern characteristics. The proposed architecture provides deployment on platforms with limited resources. Overall, this study presents a high-quality audio corpus, an improved speech recognition model, and promising results applicable to speech-related applications and languages beyond Kazakh.

show abstract

“…Mamyrbayev et al [15] introduce stream speech recognition using the RNN-T model in their study. The architecture of the model is constructed using neural networks such as LSTM and BLSTM, and it was trained using over 300 h of prepared (reading) and spontaneous speech data.…”

Section: Xlsr-53mentioning

confidence: 99%

Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper

Kozhirbayev

2023

JAIT

View full text Add to dashboard Cite

In recent years, the progress made in neural models trained on extensive multilingual text or speech data has shown great potential for improving the status of underresourced languages. This paper focuses on experimenting with three state-of-the-art speech recognition models, namely Facebook's Wav2Vec2.0 and Wav2Vec2-XLS-R, OpenAI's Whisper, on the Kazakh language. The objective of this research is to investigate the effectiveness of these models in transcribing Kazakh speech and to compare their performance with existing supervised Automatic Speech Recognition (ASR) systems. The study also aims to explore the possibility of using data from other languages for pre-training and to test whether fine-tuning the target language data can improve model performance. Thus, this work can provide insights into the effectiveness of using pretrained multilingual models in underresourced language settings. The wav2vec2.0 model achieved a Character Error Rate (CER) of 2.8 and a Word Error Rate (WER) of 8.7 on the test set, which closely matches the best result achieved by the end-to-end Transformer model. The large whisper model achieves a CER of approximately 4 on the test set. The results of this study can contribute to the development of robust and efficient ASR systems for the Kazakh language, benefiting various applications, including speech-to-text translation, voice assistants, and speech-based communication tools.

show abstract

End-to-End Model Based on RNN-T for Kazakh Speech Recognition

Cited by 11 publications

References 15 publications

Continuous Sign Language Recognition and Its Translation into Intonation-Colored Speech

Continuous Sign Language Recognition and Its Translation into Intonation-Colored Speech

The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters

Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper

Contact Info

Product

Resources

About