CASA-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions

Nassif, Ali Bou; Shahin, Ismail; Hamsa, Shibani; Nemmour, Nawel; Hirose, Keikichi

doi:10.1016/j.asoc.2021.107141

Cited by 55 publications

(20 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In the proposed work, the extracted features for the speech signal optimum representation are the Mel-frequency cepstral coefficients (MFCC) [2]. MFCC is a fundamental feature that is utilized in speaker and emotion recognition by virtue of the advanced representation of human auditory perception it provides [31][32][33]. MFCC is based on human hearing perceptions, which means that it relies on human listening features that cannot perceive frequencies over 1000 Hz.…”

Section: Feature Extractionmentioning

confidence: 99%

COVID-19 Detection Systems Using Deep-Learning Algorithms Based on Speech and Image Data

et al. 2022

Self Cite

View full text Add to dashboard Cite

The global epidemic caused by COVID-19 has had a severe impact on the health of human beings. The virus has wreaked havoc throughout the world since its declaration as a worldwide pandemic and has affected an expanding number of nations in numerous countries around the world. Recently, a substantial amount of work has been done by doctors, scientists, and many others working on the frontlines to battle the effects of the spreading virus. The integration of artificial intelligence, specifically deep- and machine-learning applications, in the health sector has contributed substantially to the fight against COVID-19 by providing a modern innovative approach for detecting, diagnosing, treating, and preventing the virus. In this proposed work, we focus mainly on the role of the speech signal and/or image processing in detecting the presence of COVID-19. Three types of experiments have been conducted, utilizing speech-based, image-based, and speech and image-based models. Long short-term memory (LSTM) has been utilized for the speech classification of the patient’s cough, voice, and breathing, obtaining an accuracy that exceeds 98%. Moreover, CNN models VGG16, VGG19, Densnet201, ResNet50, Inceptionv3, InceptionResNetV2, and Xception have been benchmarked for the classification of chest X-ray images. The VGG16 model outperforms all other CNN models, achieving an accuracy of 85.25% without fine-tuning and 89.64% after performing fine-tuning techniques. Furthermore, the speech–image-based model has been evaluated using the same seven models, attaining an accuracy of 82.22% by the InceptionResNetV2 model. Accordingly, it is inessential for the combined speech–image-based model to be employed for diagnosis purposes since the speech-based and image-based models have each shown higher terms of accuracy than the combined model.

show abstract

Section: Feature Extractionmentioning

confidence: 99%

COVID-19 Detection Systems Using Deep-Learning Algorithms Based on Speech and Image Data

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…The major motives for employing dimensionality reduction in machine learning are to enhance each of the prediction performance and the learning efficiency, to deliver faster prediction demanding less information on the original data, to decrease complexity and time of the learning outcomes and allow well understanding of the underlying procedure. This is very important when the input vector is large such as speech processing related problems [9], [10]. Lower data dimensions lead to less computing time and complexity with much less storage.…”

Section: Figure 1 Dimensionality Reduction Taxonomymentioning

confidence: 99%

“…The t-SNE transforms high-dimensional Euclidean distances into conditional probabilities showing data similarity for each set using Stochastic Neighbor Embedding (SNE) [21]. The conditional probability p ୟ|ୠ , defined in the equation below, exemplifies the resemblance of data x ୟ to data x ୠ [20]: (10) Equation (10) calculates the distance between two data points x ୟ and x ୠ using a Gaussian distribution over x ୠ and a given variance of σ ଶ , where it differs for each data set and is chosen so that data from dense areas have smaller variance than data from sparse areas [20]. Then, a "Student t-distribution" is utilized as a substitute of utilizing the Gaussian distribution with one degree of freedom, close to the Cauchy distribution, is used to get the second set of probabilities ( Q ୟ|ୠ ) in the low dimension space [22].…”

Section: T-distributed Stochastic Neighbor Embedding (T-sne)mentioning

confidence: 99%

Dimensionality Reduction: Challenges and Solutions

Ahmad

Nassif

2022

ITM Web Conf.

Self Cite

View full text Add to dashboard Cite

The use of dimensionality reduction techniques is a keystone for analyzing and interpreting high dimensional data. These techniques gather several data features of interest, such as dynamical structure, input-output relationships, the correlation between data sets, covariance, etc. Dimensionality reduction entails mapping a set of high dimensional data features onto low dimensional data. Motivated by the lack of learning models’ performance due to the high dimensionality data, this study encounters five distinct dimensionality reduction methods. Besides, a comparison between reduced dimensionality data and the original one using statistical and machine learning models is conducted thoroughly.

show abstract

“…A maioria dos trabalhos testa corrupção de sinais por ruídos apenas de maneira aditiva [12][13][14][15]. Contudo, a simples adição de ruídos não representa ambientes reais, pois os ruídos também são afetados pela reverberação de salas.…”

Section: Introductionunclassified

“…Atualmente, trabalhos que abordam situações reais utilizam bases de dados que já fornecem dados em condições de ambientes reais [12][13][14][15], como SITW [16] e NIST 2010 retransmitted [17]. Dessa forma, modelos obtém taxas de erro abaixo de 10% em seus experimentos, e podem ser consideravelmente piores no momento de utilização de sistemas por voz [18].…”

Section: Introductionunclassified

Abordagem de Simulação de Ruídos para Avaliação do Reconhecimento de Locutores em Ambientes Ruidosos

Santos

Fernandes

Parreira

2022

Anais Do XIII Computer on the Beach - COTB'22

View full text Add to dashboard Cite

There is frequent noise that affects the performance of voice recog-nition and speaker recognition algorithms in a real-world envi-ronment. For voice and speaker recognition systems to be more robust, these noises must not interfere in a harmful way, causingerrors in understanding commands. To evaluate signal degradationand speaker recognition when exposed to real-world environments,we explore a reverberation noise simulation environment usinga specific library in this work. We tested a speaker recognition model with i-vectors and Probabilistic Linear Discriminant Analy-sis (PLDA). We have analyzed the impact of noise in conjunction with reverberation on its error rate. Results based on Monte Carlosimulation showed that, for the tested cases, the noise set withreverberation worsened the recognition rate by up to 24,43%.

show abstract

CASA-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions

Cited by 55 publications

References 40 publications

COVID-19 Detection Systems Using Deep-Learning Algorithms Based on Speech and Image Data

COVID-19 Detection Systems Using Deep-Learning Algorithms Based on Speech and Image Data

Dimensionality Reduction: Challenges and Solutions

Abordagem de Simulação de Ruídos para Avaliação do Reconhecimento de Locutores em Ambientes Ruidosos

Contact Info

Product

Resources

About