The first three formants, i.e., the first three spectral prominences of the short-time magnitude spectra, have been the most commonly used acoustic cues for vowels ever since the work of Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)]. However, spectral shape features, which encode the global smoothed spectrum, provide a more complete spectral description, and therefore might be even better acoustic correlates for vowels. In this study automatic vowel classification experiments were used to compare formants and spectral-shape features for monopthongal vowels spoken in the context of isolated CVC words, under a variety of conditions. The roles of static and time-varying information for vowel discrimination were also compared. Spectral shape was encoded using the coefficients in a cosine expansion of the nonlinearly scaled magnitude spectrum. Under almost all conditions investigated, in the absence of fundamental frequency (F0) information, automatic vowel classification based on spectral-shape features was superior to that based on formants. If F0 was used as an additional feature, vowel classification based on spectral shape features was still superior to that based on formants, but the differences between the two feature sets were reduced. It was also found that the error pattern of perceptual confusions was more closely correlated with errors in automatic classification obtained from spectral-shape features than with classification errors from formants. Therefore it is concluded that spectral-shape features are a more complete set of acoustic correlates for vowel identity than are formants. In comparing static and time-varying features, static features were the most important for vowel discrimination, but feature trajectories were valuable secondary sources of information.
Two methods are described for speaker normalizing vowel spectral features: one is a multivariable linear transformation of the features and the other is a polynomial warping of the frequency scale. Both normalization algorithms minimize the mean-square error between the transformed data of each speaker and vowel target values obtained from a "typical speaker." These normalization techniques were evaluated both for formants and a form of cepstral coefficients (DCTCs) as spectral parameters, for both static and dynamic features, and with and without fundamental frequency (F0) as an additional feature. The normalizations were tested with a series of automatic classification experiments for vowels. For all conditions, automatic vowel classification rates increased for speaker-normalized data compared to rates obtained for nonnormalized parameters. Typical classification rates for vowel test data for nonnormalized and normalized features respectively are as follows: static formants--69%/79%; formant trajectories--76%/84%; static DCTCs 75%/84%; DCTC trajectories--84%/91%. The linear transformation methods increased the classification rates slightly more than the polynomial frequency warping. The addition of F0 improved the automatic recognition results for nonnormalized vowel spectral features as much as 5.8%. However, the addition of F0 to speaker-normalized spectral features resulted in much smaller increases in automatic recognition rates.
An algorithm for matching physical and perceptual spaces for psychological stimuli will be described. Target points for each stimulus class must be chosen in a multidimensional perceptual space. The physical space consists of a multidimensional measurement space, in which measurements are made of each stimulus for a large number of subjects. A linear transformation from the measurement space to the perceptual space is determined such that the mean square distance between target points and transformed measurement points is minimized. There is no requirement that the dimensionality of the measurement and perceptual spaces be the same. Thus the algorithm can be used to redefine the measurement space with fewer dimensions such that the correspondence with predefined stimulus categories is maximized. This procedure has been tested using vowels spoken in an /hVd/ context, six principal components for measurement parameters, and a three-dimensional perceptual space. Target positions in the perceptual space were based on published data from multidimensional scaling experiments for vowels. The resultant transformation has been used to map vowels to colors for use in a speech training aid for the hearing impaired. Experimental results will be given. [Work supported by the The Whitaker Foundation.]
Traditional theories of vowel perception favor formants over global spectral shape as the primary perceptual cues to vowel identity. In previous ASA meetings, results of speaker-independent automatic recognition experiments for vowels were reported that contrasted global spectral shape versus formants [A. J. Jagharghi and S. A. Zahorian, J. Acoust. Soc. Am. Suppl. 1 81, S18 (1987); S. A. Zahorian and A. J. Jagharghi, J. Acoust. Soc. Am. Suppl. 1 82, S37 (1987)]. These results indicate that automatic recognition rates based on global spectral shape are generally slightly superior to recognition rates based on formants. In the present study, the perception of vowels is investigated for vowels synthesized such that the synthesized tokens contain conflicting cues to vowel identity based on overall spectral shape versus formants. Two distinct but close vowels are selected. The spectral shape of the first vowel is modified to match, to the extent possible, the spectral shape of the second vowel without any change in the formant frequencies for F1, F2, and F3. Thus the modified vowel has the same formants as the first vowel, but its spectral shape matches that of the second vowel. Listening experiments indicate that, for most conditions, the modified vowel segments are perceived according to spectral shape cues rather than formant cues. The details of the experimental procedures and the results of the listening experiments will be presented at the meeting. [Work supported by NSF.]
Automatic recognition experiments were performed to compare overall spectral shape versus formants as speaker-independent acoustic parameters for vowel identity. Stimuli consisted of four repetitions of 11 vowels spoken by 17 female speakers and 12 male speakers (29*11*4 = 1276 total stimuli). Formants were computed automatically by peak picking of 12th-order LP model spectra. Spectral shape was represented using three methods: (1) by a cosine basis vector expansion of the power spectrum: (2) as the output of a 16-channel, 1/3-oct filter bank; and (3) as the output of a 16-channel mel-spaced filter bank. Automatic recognition was based on maximum likelihood estimation in a multidimensional space. For all cases considered, the representations based on spectal shape resulted in significantly higher recognition accuracy than for recognition based on only three formants. For example, using the entire database of all speakers and 11 vowels, recognition based on spectral shape was about 85% vs 69% for three formants. If the data were restricted to female speakers and the seven vowels /a,i,u,æ,ɝ,ɪ,ɛ/, recognition was about 97% based on spectral shape versus 84% for formants. These results indicate that, at least for automatic recognition of vowels, spectral peak detection is neither necessary nor sufficient. [Work supported by NSF.]
It has generally been assumed at least since the time of the comprehensive study by Peterson and Barney [J. Acoust. Soc. Am. 24, 175–184 (1952)] that the formant locations in vowel spectra are the most significant cues to vowel identity. In this experiment vowel spectra were represented by two methods: (A) by the locations of the first three formants, and (B) by the overall smoothed spectral shape in terms of a discrete cosine transform of the power spectra. Stimuli consisted of four repetitions of the widely separated vowels /u/, /i/, /a/, spoken by each of 12 female and 12 male speakers (4⋅24⋅3 = 288 stimuli total). For each of the two spectral encoding methods, A and B, the vowel data were projected to a three-dimensional space such that the vowel categories would be well separated and the vowels within each category well clustered [S. A. Zahorian and A. J. Jagharghi, J. Acoust. Soc. Am. Suppl. 1 79, S8 (1986)]. Significantly better clustering was obtained with method B, based on overall spectral shape, than for method A, based only on the first three formant frequencies. Since these results are not based on perceptual experiments, no direct conclusion can be drawn regarding the perceptual importance of spectral peaks versus overall spectral shape for human perception of vowels. However, the results do indicate that automatic machine identification of vowels can be improved by parameterizing the overall spectral shape rather than only the spectral peaks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.