Arabic speech recognition using end‐to‐end deep learning

Alsayadi, Hamzah A.; Abdelhamid, Abdelaziz A.; Hegazy, Islam; Fayed, Zaki T.

doi:10.1049/sil2.12057

Cited by 25 publications

(15 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, CNN has new properties above DNN, such as localization, weight sharing, and pooling. In the convolution unit, the locality is employed to handle noise where is used [31,32]. Additionally, locality minimizes the network weights that must be learned.…”

Section: Theoretical Backgroundmentioning

confidence: 99%

“…Other models, such as GMMs and DNNs, find it harder to manipulate this shifting process. As a result, ASR researchers have recently employed localization in both frequency and time axes in speech signals [31,32].…”

Section: Theoretical Backgroundmentioning

confidence: 99%

“…To reduce the computational cost, LSTM presents the outputs. Therefore, in ASR, the input length is not equivalent to the output length [31].…”

Section: Theoretical Backgroundmentioning

confidence: 99%

“…The look-ahead word-based LM probabilities are determined in each recognition phase according to the decoding of the word prefixes. The prefix trees are used to transform a word-based LM into a character-based LM [17,31]. We used a completely parallelized version of the decoding technique for GPUs that was presented in Espresso for Parallelization LM.…”

Section: Language Modelingmentioning

confidence: 99%

See 3 more Smart Citations

An End-to-End Transformer-Based Automatic Speech Recognition for Qur’an Reciters

Hadwan¹,

Alsayadi²,

Al-Hagree³

2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

The attention-based encoder-decoder technique, known as the trans-former, is used to enhance the performance of end-to-end automatic speech recognition (ASR). This research focuses on applying ASR end-toend transformer-based models for the Arabic language, as the researchers' community pays little attention to it. The Muslims Holy Qur'an book is written using Arabic diacritized text. In this paper, an end-to-end transformer model to building a robust Qur'an vs. recognition is proposed. The acoustic model was built using the transformer-based model as deep learning by the PyTorch framework. A multi-head attention mechanism is utilized to represent the encoder and decoder in the acoustic model. A Mel filter bank is used for feature extraction. To build a language model (LM), the Recurrent Neural Network (RNN) and Long short-term memory (LSTM) were used to train an n-gram word-based LM. As a part of this research, a new dataset of Qur'an verses and their associated transcripts were collected and processed for training and evaluating the proposed model, consisting of 10 h of .wav recitations performed by 60 reciters. The experimental results showed that the proposed end-to-end transformer-based model achieved a significant low character error rate (CER) of 1.98% and a word error rate (WER) of 6.16%. We have achieved state-of-the-art end-to-end transformer-based recognition for Qur'an reciters.

show abstract

Section: Theoretical Backgroundmentioning

confidence: 99%

Section: Theoretical Backgroundmentioning

confidence: 99%

“…To reduce the computational cost, LSTM presents the outputs. Therefore, in ASR, the input length is not equivalent to the output length [31].…”

Section: Theoretical Backgroundmentioning

confidence: 99%

Section: Language Modelingmentioning

confidence: 99%

See 2 more Smart Citations

An End-to-End Transformer-Based Automatic Speech Recognition for Qur’an Reciters

Hadwan¹,

Alsayadi²,

Al-Hagree³

2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

show abstract

“…In the development of the automatic speech recognition system, as before, attention is paid to end-to-end methods; many studies have proven that performance and accuracy increase with an increase in the amount of data used for training. For example, in published studies, the best results in training big data were obtained by end-to-end systems based on CTC [5,6] and attention-based encoder-decoder models. In end-to-end models, all parameters are calculated by the gradient descent method, which is easily influenced by the structure of the neural network.…”

Section: Introductionmentioning

confidence: 99%

Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level

Mamyrbayev

Alimhan

Оралбекова

et al. 2022

EEJET

View full text Add to dashboard Cite

Ensuring the best quality and performance of modern speech technologies, today, is possible based on the widespread use of machine learning methods. The idea of this project is to study and implement an end-to-end system of automatic speech recognition using machine learning methods, as well as to develop new mathematical models and algorithms for solving the problem of automatic speech recognition for agglutinative (Turkic) languages. Many research papers have shown that deep learning methods make it easier to train automatic speech recognition systems that use an end-to-end approach. This method can also train an automatic speech recognition system directly, that is, without manual work with raw signals. Despite the good recognition quality, this model has some drawbacks. These disadvantages are based on the need for a large amount of data for training. This is a serious problem for low-data languages, especially Turkic languages such as Kazakh and Azerbaijani. To solve this problem, various methods are needed to apply. Some methods are used for end-to-end speech recognition of languages belonging to the group of languages of the same family (agglutinative languages). Method for low-resource languages is transfer learning, and for large resources – multi-task learning. To increase efficiency and quickly solve the problem associated with a limited resource, transfer learning was used for the end-to-end model. The transfer learning method helped to fit a model trained on the Kazakh dataset to the Azerbaijani dataset. Thereby, two language corpora were trained simultaneously. Conducted experiments with two corpora show that transfer learning can reduce the symbol error rate, phoneme error rate (PER), by 14.23 % compared to baseline models (DNN+HMM, WaveNet, and CNC+LM). Therefore, the realized model with the transfer method can be used to recognize other low-resource languages.

show abstract