2022
DOI: 10.3390/s22197319
|View full text |Cite
|
Sign up to set email alerts
|

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Abstract: Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 64 publications
0
1
0
Order By: Relevance
“… Yu et al (2022) combined the biLSTM layer with dimension reduction and showed that they saved up to 0.5 days of processing time on the dataset they analyzed. Ren et al (2022) observed a decrease in the WER score for the LibriSpeech, Common Voice-Turkish, and Common Voice-UZBEK datasets in the ratios of 2.96%, 7.07%, and 7.08%, respectively, by using the proposed feature extraction technique. Oruh, Viriri & Adegun (2022) achieved a 99.36% accuracy on the English digit dataset with the model that they proposed to address the memory bandwidth problem of the LSTM layer.…”
Section: Literature Reviewmentioning
confidence: 86%
“… Yu et al (2022) combined the biLSTM layer with dimension reduction and showed that they saved up to 0.5 days of processing time on the dataset they analyzed. Ren et al (2022) observed a decrease in the WER score for the LibriSpeech, Common Voice-Turkish, and Common Voice-UZBEK datasets in the ratios of 2.96%, 7.07%, and 7.08%, respectively, by using the proposed feature extraction technique. Oruh, Viriri & Adegun (2022) achieved a 99.36% accuracy on the English digit dataset with the model that they proposed to address the memory bandwidth problem of the LSTM layer.…”
Section: Literature Reviewmentioning
confidence: 86%
“…Although hybrid CTC/attention ASR systems have gained popularity and improved significantly even in low-resource environments, they are rarely used for Central Asian languages like Turkish and Uzbek. Ren et al [ 34 ] proposed a CNN-based feature extractor called Multi-Scale Parallel Convolution (MSPC) that uses different convolution kernel sizes to extract features of different sizes and combined it with bidirectional long short-term memory (Bi-LSTM) to form an encoder structure to boost the end-to-end model’s recognition rate and system robustness. The authors initialized the RNN language model with a fine-tuned pre-trained BERT and incorporated it into the decoding process.…”
Section: Related Workmentioning
confidence: 99%
“…Despite the growing popularity and advancements in hybrid CTC/attention ASR systems, particularly in low-resource languages, their application to Central Asian languages like Turkish and Uzbek remains limited. Ren et al [44] introduced a novel feature extraction method using CNNs, termed multiscale parallel convolution (MSPC). This technique utilizes convolution kernels of varying sizes to capture features at different scales, combined with a bidirectional long short-term memory (Bi-LSTM) network to boost the accuracy and stability of the end-to-end model.…”
Section: Related Workmentioning
confidence: 99%