Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations

Kethireddy, Rashmi; Kadiri, Sudarsana Reddy; Gangashetty, Suryakanth V.

doi:10.1121/10.0009405

Cited by 6 publications

(6 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The study in [17] suggests a set of deep neural models to classify the most known English dialects. The used deep classifiers are the time-delay neural network (TDNN), the convolution neural network (CNN), the temporal convolution neural network (TCN), and the TDNN with emphasized channel attention (ECAPA-TDNN.…”

Section: Related Workmentioning

confidence: 99%

Recognition of Arabic Accents From English Spoken Speech Using Deep Learning Approach

Habbash,

Mnasri,

Alghamdi

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Accents, or changes in how different people speak the same word/sentence in the same language, pose substantial communication issues in most spoken languages. This is a well-known fact, but how does the accent of one language affect learning/speaking another? In this paper, we look at how Arab accents influence the English language. To that end, we built a deep machine-learning system for Arabic accent recognition that was learned from an in-house English speech database of four Arabic accents collected from Jordan, Iraq, Saudi Arabia, and Tunisia. The proposed system employs Mel spectrograms of an English-spoken paragraph to train an LSTM neural network to recognize the accent in each sound signal. Although the collected data was extremely difficult to learn due to the presence of both males and females and fluent speakers in each class, the proposed system could recognize speakers with various accents by up to 79%. This answers the study's main question, demonstrating that speakers with an Arabic accent have their way of speaking English, which varies by country. As a result, if trained on appropriate and adequate data, the proposed system can also be used to recognize accents in any language.

show abstract

Section: Related Workmentioning

confidence: 99%

Recognition of Arabic Accents From English Spoken Speech Using Deep Learning Approach

Habbash,

Mnasri,

Alghamdi

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Many feature sets have been proposed with statistical and deep learning-based classifiers. A few widely used feature sets are as follows: Mel frequency cepstrum coefficients (MFCCs); inverse MFCCs (IMFCCs) [ 15 ]; linear frequency cepstrum coefficients (LFCCs); constant Q cepstrum coefficients (CQCCs) [ 16 ]; log-power spectrum using discrete Fourier transform (DFT) [ 17 ]; Gammatonegram, group delay over the frame, referred to as GD-gram [ 18 ]; modified group delay; All-Pole Group Delay [ 19 ]; Cochlear Filter Cepstral Coefficient—Instantaneous Frequency [ 20 ]; cepstrum coefficients using single-frequency filtering [ 21 , 22 ]; Zero-Time Windowing (ZTW) [ 23 ]; Mel-frequency cepstrum using ZTW [ 24 ]; and polyphase IIR filters [ 25 ]. The human ear uses Fourier transform magnitude and neglects the phase information [ 26 ].…”

Section: Related Workmentioning

confidence: 99%

Gaussian-Filtered High-Frequency-Feature Trained Optimized BiLSTM Network for Spoofed-Speech Classification

Mewada,

Al-Asad,

Almalki

et al. 2023

Sensors

View full text Add to dashboard Cite

Voice-controlled devices are in demand due to their hands-free controls. However, using voice-controlled devices in sensitive scenarios like smartphone applications and financial transactions requires protection against fraudulent attacks referred to as “speech spoofing”. The algorithms used in spoof attacks are practically unknown; hence, further analysis and development of spoof-detection models for improving spoof classification are required. A study of the spoofed-speech spectrum suggests that high-frequency features are able to discriminate genuine speech from spoofed speech well. Typically, linear or triangular filter banks are used to obtain high-frequency features. However, a Gaussian filter can extract more global information than a triangular filter. In addition, MFCC features are preferable among other speech features because of their lower covariance. Therefore, in this study, the use of a Gaussian filter is proposed for the extraction of inverted MFCC (iMFCC) features, providing high-frequency features. Complementary features are integrated with iMFCC to strengthen the features that aid in the discrimination of spoof speech. Deep learning has been proven to be efficient in classification applications, but the selection of its hyper-parameters and architecture is crucial and directly affects performance. Therefore, a Bayesian algorithm is used to optimize the BiLSTM network. Thus, in this study, we build a high-frequency-based optimized BiLSTM network to classify the spoofed-speech signal, and we present an extensive investigation using the ASVSpoof 2017 dataset. The optimized BiLSTM model is successfully trained with the least epoch and achieved a 99.58% validation accuracy. The proposed algorithm achieved a 6.58% EER on the evaluation dataset, with a relative improvement of 78% on a baseline spoof-identification system.

show abstract

“…[1] used TDNN to predict the active power demand on a P4 bus in President Prudente. Experimental results demonstrated its validity [11]. [21]used TDNN as a facial expression classifier for an intelligent robot to establish command laws by analyzing and recognizing facial expressions to translate expressions into robot-recognizable language.…”

Section: Introductionmentioning

confidence: 99%

“…The paper [11] investigated the performance of deep neural networks, convolutional neural networks, temporal convolutional neural networks, and TDNN for English dialect classification. The results showed that TDNN and ECAPA-TDNN classifiers capture a wider temporal context, further improving the performance of the classification models.…”

Section: Introductionmentioning

confidence: 99%

A Study of Blockchain and Machine Learning-Enabled IoT Security in Time-Delayed Neural Network Vocal Pattern Recognition to Improve Web-Based Vocal Teaching

Long

2024

SCPE

View full text Add to dashboard Cite

With the development of information technology, online vocal teaching is becoming more and more popular, but the sound quality of teaching is also becoming more and more demanding. As online vocal instruction becomes more popular, the need for high-quality sound in these digital environments becomes more critical. This research tackles the problem of improving sound quality in real-time vocal teaching by integrating advanced technologies such as Blockchain and Machine Learning within the Internet of Things (IoT) security framework. We created a vocal recognition model using Time-Delay Neural Network (TDNN) and improved it with Generated Feature Vector (GFV). This integration yields a strong GTDNN vocal recognition system that is specifically designed to secure and optimize web-based vocal teaching. Our experiments show that GTDNN outperforms traditional TDNN and i-vector methods in feature vector extraction, adapting well to different speech environments. In various speech settings, GTDNN's Error Rates (EERs) are impressively low at 11.3%, 12.0%, 4.9%, 6.2%, and 6.1%, indicating superior performance over comparison models. GTDNN has an EER of 9.6% for short-duration speech and 2.3% for long-duration speech. Furthermore, the GTDNN system achieves an overall pass rate of 94% for target speech and an impressive rejection rate for non-target speech, ensuring high accuracy in a variety of speech environments.

show abstract

Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations

Cited by 6 publications

References 51 publications

Recognition of Arabic Accents From English Spoken Speech Using Deep Learning Approach

Recognition of Arabic Accents From English Spoken Speech Using Deep Learning Approach

Gaussian-Filtered High-Frequency-Feature Trained Optimized BiLSTM Network for Spoofed-Speech Classification

A Study of Blockchain and Machine Learning-Enabled IoT Security in Time-Delayed Neural Network Vocal Pattern Recognition to Improve Web-Based Vocal Teaching

Contact Info

Product

Resources

About