Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques

Elmaghraby, Eslam Eid; Gody, Amr M.; Farouk, Mohammed

doi:10.21608/ejle.2020.22022.1002

Cited by 6 publications

(2 citation statements)

References 32 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PCA uses statistical tools to identify noise and redundancy in the dataset [30]. It keeps the necessary parts that have more variation of the data and removes the unnecessary parts with fewer variations, therefore speeding up the training and testing time of the machine learning algorithm.…”

Section: Model Setupmentioning

confidence: 99%

Arabic Automatic Speech Recognition Based on Emotion Detection

Abdelmaksoud

2021

The Egyptian Journal of Language Engineering

View full text Add to dashboard Cite

This work presents a novel emotion recognition via automatic speech recognition (ASR) using a deep feed-forward neural network (DFFNN) for Arabic speech. We present results for the recognition of the three emotions happy, angry, and surprised. The Arabic natural audio dataset (ANAD) is used. Twenty-five low-level descriptors (LLDs) are extracted from the audio signals. Different combination of extracted features is examined. Also, the effect of using the principal component analysis (PCA) technique for dimensionality reduction is examined. For the classification stage, DFFNN is used. Also, the problem of imbalances samples in the dataset is managed by using the borderline-synthetic minority over-sampling technique (B-SMOTE). It is shown from the results thatthe best accuracy is obtained when applying PCA on the extracted features is 98.56 %. Also, the accuracy is 98.33 % when using the combination of all the extracted features. This result is not too much different from the accuracy of using PCA. It is followed by the accuracy of using MFCC and LSF which is 97.79 %. It is noticed that the accuracy is 95.63 % when using LSF features which shows that they are dominant features. The obtained results showed an improvement compared to previous studies.

show abstract

Section: Model Setupmentioning

confidence: 99%

Arabic Automatic Speech Recognition Based on Emotion Detection

Abdelmaksoud

2021

The Egyptian Journal of Language Engineering

View full text Add to dashboard Cite

show abstract

“…Alternatively, it is known that the CNN model is a deep learning algorithm that can perform complex tasks with images, videos, texts, and sounds that are inspired by the human visual system [3]. CNN's achieved great success in image recognition [4], and recently they are widely adopted in ASR systems [5]- [9]. Most leading technology companies like Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, have initiated research and development projects [10]- [13] which employs CNN for image recognition products and services.…”

Section: Introductionmentioning

confidence: 99%

Convolutional Neural Network for Arabic Speech Recognition

Abdelmaksoud

Hassen

Hassan

et al. 2021

The Egyptian Journal of Language Engineering

View full text Add to dashboard Cite

This work is focused on single word Arabic automatic speech recognition (AASR). Two techniques are used during the feature extraction phase; Log frequency spectral coefficients (MFSC) and Gammatone-frequency cepstral coefficients (GFCC) with their first and second-order derivatives. The convolutional neural network (CNN) is mainly used to execute feature learning and classification process. CNN achieved performance enhancement in automatic speech recognition (ASR). Local connectivity, weight sharing, and pooling are the crucial properties of CNNs that have the potential to improve ASR. We tested the CNN model using an Arabic speech corpus of isolated words. The used corpus is synthetically augmented by applying different transformations such as changing the pitch, the speed, the dynamic range, adding noise, and forward and backward shift in time. It was found that the maximum accuracy obtained when using GFCC with CNN is 99.77 %. The outcome results of this work are compared to previous reports and indicate that CNN achieved better performance in AASR.

show abstract