This study investigates possibilities to find a low-dimensional, formant-related physical representation of speech signals, which is suitable for automatic speech recognition. This aim is motivated by the fact that formants are known to be discriminant features for speech recognition. Combinations of automatically extracted formant-like features and state-of-the-art, noise-robust features have previously been shown to be more robust in adverse conditions than state-of-the-art features alone. However, it is not clear 1 de Wet, JASA how these automatically extracted formant-like features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formant-like features, i.e. robust formants and HMM2 features, and to compare these features to hand-labeled formants as well as to mel-frequency cepstral coefficients in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in [Hillenbrand et al., J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data as well as in (simulated) adverse conditions. In combination with standard automatic speech recognition methods, the classification performance of the robust formant and HMM2 features compare very well to the performance of the hand-labeled formants.PACS numbers: 43.72.Ne, 43.72.Ar 2 de Wet, JASA
I IntroductionHuman speech signals can be described in many different ways (Flanagan, 1972;Rabiner and Schafer, 1978). Some descriptions are directly related to speech production, while others are more suitable for investigating speech perception. Some descriptive frameworks, of which the formant representation is a well-known example, have successfully been applied to both production and perception.Speech production is often modeled as an acoustic source feeding into a linear filter (representing the vocal tract) with little or no interaction between the source and the filter. In terms of this model of acoustic speech production, the phonetically relevant properties of speech signals can be characterized by the resonance frequencies of the filter (to be completed with information on the source, in terms of periodicity and power). It is well known that the frequencies of the first two or three formants are sufficient information for the perceptual identification of vowels (Flanagan, 1972;Minifie et al., 1973). The formant representation is attractive because of its parsimonious character: it allows the representation of speech signals with a very small number of parameters. Not surprisingly, many attempts have been made to exploit the parametric formant representation in speech technology applications such as speech synthesis, speech coding and automatic speech recognition (ASR).A special reason why formants make for an attractive representation of the acoustic characteristics of speech signals is their relation -by virt...