Over the past decades, a tremendous amount of research has been done on the use of machine learning for speech processing applications, especially speech recognition. However, in the past few years, research has focused on utilizing deep learning for speech-related applications. This new area of machine learning has yielded far better results when compared to others in a variety of applications including speech, and thus became a very attractive area of research. This paper provides a thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications. A thorough statistical analysis is provided in this review which was conducted by extracting specific information from 174 papers published between the years 2006 and 2018. The results provided in this paper shed light on the trends of research in this area as well as bring focus to new research topics. INDEX TERMS Speech recognition, deep neural network, systematic review.
In this paper, Suprasegmental Hidden Markov Models (SPHMMs) have been used to enhance the recognition performance of text-dependent speaker identification in the shouted environment. Our speech database consists of two databases: our collected database and the Speech Under Simulated and Actual Stress (SUSAS) database. Our results show that SPHMMs significantly enhance speaker identification performance compared to Second-Order Circular Hidden Markov Models (CHMM2s) in the shouted environment. Using our collected database, speaker identification performance in this environment is 68% and 75% based on CHMM2s and SPHMMs respectively. Using the SUSAS database, speaker identification performance in the same environment is 71% and 79% based on CHMM2s and SPHMMs respectively.
It is known that the performance of speaker identification systems is high under the neutral talking condition; however, the performance deteriorates under the shouted talking condition. In this paper, second-order circular hidden Markov models (CHMM2s) have been proposed and implemented to enhance the performance of isolated-word text-dependent speaker identification systems under the shouted talking condition. Our results show that CHMM2s significantly improve speaker identification performance under such a condition compared to the first-order left-to-right hidden Markov models (LTRHMM1s), second-order left-to-right hidden Markov models (LTRHMM2s), and the first-order circular hidden Markov models (CHMM1s). Under the shouted talking condition, our results show that the average speaker identification performance is 23% based on LTRHMM1s, 59% based on LTRHMM2s, and 60% based on CHMM1s. On the other hand, the average speaker identification performance under the same talking condition based on CHMM2s is 72%.
This paper aims at recognizing emotions for a text-independent and speaker-independent emotion recognition system based on a novel classifier, which is a hybrid of a cascaded Gaussian mixture model and deep neural network (GMM-DNN). This hybrid classifier has been assessed for emotion recognition on ''Emirati speech database (Arabic United Arab Emirates Database)'' with six different emotions. The sequential GMM-DNN classifier has been contrasted with support vector machines (SVMs) and multilayer perceptron (MLP) classifiers, and its performance accuracy is indexed at 83.97%, while the other two perform at 80.33% and 69.78% using SVMs and MLP, respectively. These results demonstrate that the hybrid classifier significantly gives higher emotion recognition accuracy than SVMs and MLP classifiers. Our GMM-DNN model yields the results similar to those obtained by human judges in a subjective assessment context. Also, the performance of the classifier has been tested using two distinct emotional databases and in normal and noisy talking conditions. The dominant signal mask provided by the hybrid classifier offers better system performance in the presence of noisy signals. INDEX TERMS Deep neural network, emotion recognition, Gaussian mixture model.
Speaker identification systems perform well under the neutral talking condition; however, they suffer sharp degradation under the shouted talking condition. In this paper, the second-order hidden Markov models (HMM2s) have been used to improve the recognition performance of isolated-word text-dependent speaker identification systems under the shouted talking condition. Our results show that HMM2s significantly improve the speaker identification performance compared to the first-order hidden Markov models (HMM1s). The average speaker identification performance under the shouted talking condition based on HMM1s is 23%. On the other hand, the average speaker identification performance based on HMM2s is 59%
Usually, people talk neutrally in environments where there are no abnormal talking conditions such as stress and emotion. Other emotional conditions that might affect people's talking tone include happiness, anger, and sadness. Such emotions are directly affected by the patient's health status. In neutral talking environments, speakers can be easily verified; however, in emotional talking environments, speakers cannot be easily verified as in neutral talking ones. Consequently, speaker verification systems do not perform well in emotional talking environments as they do in neutral talking environments. In this work, a two-stage approach has been employed and evaluated to improve speaker verification performance in emotional talking environments. This approach employs speaker's emotion cues (text-independent and emotion-dependent speaker verification problem) based on both hidden Markov models (HMMs) and suprasegmental HMMs as classifiers. The approach is composed of two cascaded stages that combine and integrate an emotion recognizer and a speaker recognizer into one recognizer. The architecture has been tested on two different and separate emotional speech databases: our collected database and the Emotional Prosody Speech and Transcripts database. The results of this work show that the proposed approach gives promising results with a significant improvement over previous studies and other approaches such as emotion-independent speaker verification approach and emotion-dependent speaker verification approach based completely on HMMs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.