MFCC (Mel Frequency Cepstral Coefficients) and PLP (Perceptual linear prediction coefficients)or RASTA-PLP have demonstrated good results whether when they are used in combination with prosodic features as suprasegmental (long-term) information or when used stand-alone as segmental (short-time) information. MFCC and PLP feature parameterization aims to represent the speech parameters in a way similar to how sound is perceived by humans. However, MFCC and PLP are usually computed from a Hamming-windowed periodogram spectrum estimate that is characterized by large variance. In this paper we study the effect of averaging spectral estimates obtained using a set of orthogonal tapers (windows) on emotion recognition performance. The multitaper MFCC and PLP are examined separately as short-time information vectors modeled using Gaussian mixture models (GMMs). When tested on the FAU AIBO spontaneous emotion corpus, a relative improvement ranging from 2.2% to 3.9% for both MFCC and PLP systems is achieved by multiple windowed spectral features compared to single windowed ones.
Several methods have recently been proposed to analyze speech and automatically infer the personality of the speaker. These methods often rely on prosodic and other hand crafted speech processing features extracted with off-the-shelf toolboxes. To achieve high accuracy, numerous features are typically extracted using complex and highly parameterized algorithms. In this paper, a new method based on feature learning and spectrogram analysis is proposed to simplify the feature extraction process while maintaining a high level of accuracy. The proposed method learns a dictionary of discriminant features from patches extracted in the spectrogram representations of training speech segments. Each speech segment is then encoded using the dictionary, and the resulting feature set is used to perform classification of personality traits. Experiments indicate that the proposed method achieves state-of-the-art results with a significant reduction in complexity when compared to the most recent reference methods. The number of features, and difficulties linked to the feature extraction process are greatly reduced as only one type of descriptors is used, for which the 6 parameters can be tuned automatically. In contrast, the simplest reference method uses 4 types of descriptors to which 6 functionals are applied, resulting in over 20 parameters to be tuned.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.