Comparison of feature extraction methods for speech recognition in noise-free and in traffic noise environment

Sárosi, Gellért; Mozsary, Mihaly; Mihajlik, Péter; Fegyó, Tibor

doi:10.1109/sped.2011.5940729

Cited by 16 publications

(7 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, when this system was evaluated with the handset TIMIT (HTIMIT) Corpus, which is a database of speech data collected over different telephone channels, the accuracy was degraded to 34.4%, owing to the distortions that are present in communication channels. In research [55], two different noise signals: white noise and street noise were considered for the task of word recognition of six languages: English, German, French, Italian, Spanish and Hungarian. The results obtained showed that both PLP and MFCC achieved approximately the same accuracies.…”

Section: Automatic Speech Recognition Systemsmentioning

confidence: 99%

Comparative study of automatic speech recognition techniques

Cutajar

Gatt

Grech

et al. 2013

IET signal process.

View full text Add to dashboard Cite

Over the past decades, extensive research has been carried out on various possible implementations of automatic speech recognition (ASR) systems. The most renowned algorithms in the field of ASR are the mel-frequency cepstral coefficients and the hidden Markov models. However, there are also other methods, such as wavelet-based transforms, artificial neural networks and support vector machines, which are becoming more popular. This review article presents a comparative study on different approaches that were proposed for the task of ASR, and which are widely used nowadays. † training time increases linearly with increase in vocabulary size [42] † quantisation error in the discrete representation of speech signals [42] † temporal information is ignored [42] PCA † reduction in the feature vector's size, while retaining much of the significant information [131] † robust [59, 60] † computationally expensive for high-dimensional data [8] LDA † maximises the distance between classes, but minimises the within class distance [132] † robust [133] † sample distribution is assumed a priori to be Gaussian [63] † class samples are assumed to have equal variance [63] Classification technique Advantages Disadvantages HMM † able to model time distribution of speech signals [103] † simple to adapt [68] † capable to model a sequence of discrete or continuous symbols [13] † inputs can be of variable length [40] † based on the assumption that the probability of being in a particular state is dependent only on its preceding state, ignoring any long-term dependencies [82] † emission probabilities are arbitrarily chosen; hence, these might not even represent properly the output probabilities of the corresponding state [82] ANN (in general) † good classifiers [16, 45] † highly adequate for pattern recognition applications [16, 45] † self-organising [16, 45] † self-learning [16, 45] † self-adaptive in new environments [16, 45] † robust [7] † based on ERM; hence, prone to over training a local minima problems [45, 103] MLP † good discriminating ability [2] † unable to model time distribution of speech signals [2] † inputs have to be of fixed length [2] † able to deal with small vocabularies only [2] SOM † no a priori information is required for training a SOM [134] † can easily adapt if a new sample is presented to it [134] † capable of parallel computation [134] † SOM algorithm is not well defined mathematically; hence, values for the network parameters need to be found by trial-and-error [134] † ordered mapping obtained after the training phase may be lost when applied in real environments due to frequent adaptations [134] RBF † simple to implement [135] † Good discriminating ability [135] † robust [135] † online learning ability [135] † shift invariant in time [91] RNN † able to model time distribution of speech signals thanks to the feedback connections [95, 103] † complex training algorithm [94] † training algorithm is highly sensitive to any changes [94] FNN † does not need large amount of samples during the learning process [99] † ...

show abstract

Section: Automatic Speech Recognition Systemsmentioning

confidence: 99%

Comparative study of automatic speech recognition techniques

Cutajar

Gatt

Grech

et al. 2013

IET signal process.

View full text Add to dashboard Cite

show abstract

“…Feature extraction is achieved by transforming the speech waveform to a parametric representation for subsequent processing and analysis at a lower data rate. Once quality features are extracted, classification is easy [9].…”

Section: Introductionmentioning

confidence: 99%

ERIL: An Algorithm for Emotion Recognition From Indian Languages Using Machine Learning

Mehra¹,

Jain

2021

Preprint

View full text Add to dashboard Cite

For a human interaction with machine, it is important that it understand the mood of the speaker. Until now we train machines on neutral speeches or utterances. The mood of a person would affect their performances. Deciphering human mood is challenging for the machines, as human can create fourteen distinct sound in a second. For a machine to understand the human behaviour, it should understand the acoustic abilities of the human ear. Mel Frequency Cepstral Coefficients (MFCC) and Linear Prediction coefficients (LPC) can replicate human auditory system. The proposed model Emotion Recognition from Indian Languages (ERIL) extracts emotions like fear, anger, surprise, sadness, happiness, and neutral. ERIL first pre-processes the voice signal, extracts selective MFCC, LPC, pitch, and voice quality features, then classifies the speech using Catboost. ERIL is a multilingual emotion classifier, it is independent of any language. We checked it on Hindi, Gujarati, Marathi, Punjabi, Bangla, Tamil, Oriya, and Telugu. We recorded a speech dataset of various emotions in these languages. ERIL is compared to other benchmark classifiers.

show abstract

“…Another challenge is sufficient advancement of children ASR system where intelligent speech innovations: YouTube Kids, Amazon Alexa, and computeraided language learning has been currently crucial in the process of classroom learning (Valente et al 2012). Since, the acoustic and linguistic patterns in case of children speech signals are very unique which indulge speaking rate, vocal tract length when contrasted to an adult speech signal (Subramanian et al 2019). Additionally, the accessibility of limited children speech datasets even in the context of native language prompts obstruction in development of efficient children speech recognition systems.…”

Section: Introductionmentioning

confidence: 99%

Spectral-Warping Based Noise-Robust Enhanced Children ASR System

Bawa

Kadyan

Kumar

et al. 2021

Preprint

View full text Add to dashboard Cite

In real-life applications, noise originating from different sound sources modifies the characteristics of an input signal which affects the development of an enhanced ASR system. This contamination degrades the quality and comprehension of speech variables while impacting the performance of human-machine communication systems. This paper aims to minimise noise challenges by using a robust feature extraction methodology through introduction of an optimised filtering technique. Initially, the evaluations for enhancing input signals are constructed by using state transformation matrix and minimising a mean square error based upon the linear time variance techniques of Kalman and Adaptive Wiener Filtering. Consequently, Mel-frequency cepstral coefficients (MFCC), Linear Predictive Cepstral Coefficient (LPCC), RelAtive SpecTrAl-Perceptual Linear Prediction (RASTA-PLP) and Gammatone Frequency cepstral coefficient (GFCC) based feature extraction methods have been synthesised with their comparable efficiency in order to derive the adequate characteristics of a signal. It also handle the large-scale training complexities lies among the training and testing dataset. Consequently, the acoustic mismatch and linguistic complexity of large-scale variations lies within small set of speakers have been handle by utilising the Vocal Tract Length Normalization (VTLN) based warping of the test utterances. Furthermore, the spectral warping approach has been used by time reversing the samples inside a frame and passing them into the filter network corresponding to each frame. Finally, the overall Relative Improvement (RI) of 16.13% on 5-way perturbed spectral warped based noise augmented dataset through Wiener Filtering in comparison to other systems respectively.

show abstract

Comparison of feature extraction methods for speech recognition in noise-free and in traffic noise environment

Cited by 16 publications

References 5 publications

Comparative study of automatic speech recognition techniques

Comparative study of automatic speech recognition techniques

ERIL: An Algorithm for Emotion Recognition From Indian Languages Using Machine Learning

Spectral-Warping Based Noise-Robust Enhanced Children ASR System

Contact Info

Product

Resources

About