A quantitative perceptual model of human vowel recognition based upon psychoacoustic and speech perception data is described. At an intermediate auditory stage of processing, the specific bark difference level of the model represents the pattern of peripheral auditory excitation as the distance in critical bands (barks) between neighboring formants and between the fundamental frequency (F0) and first formant (F1). At a higher, phonetic stage of processing, represented by the critical bark difference level of the model, the transformed vowels may be dichotomously classified based on whether the difference between formants in each dimension falls within or exceeds the critical distance of 3 bark for the spectral center of gravity effect [Chistovich et al., Hear. Res. 1, 185-195 (1979)]. Vowel transformations and classifications correspond well to several major phonetic dimensions and features by which vowels are perceived and traditionally classified. The F1-F0 dimension represents vowel height, and high vowels have F1-F0 differences within 3 bark. The F3-F2 dimension corresponds to vowel place of articulation, and front vowels have F3-F2 differences of less than 3 bark. As an inherent, speaker-independent normalization procedure, the model provides excellent vowel clustering while it greatly reduces between-speaker variability. It offers robust normalization through feature classification because gross binary categorization allows for considerable acoustic variability. There was generally less formant and bark difference variability for closely spaced formants than for widely spaced formants. These findings agree with independently observed perceptual results and support Stevens' quantal theory of vowel production and perceptual constraints on production predicted from the critical bark difference level of the model.
The effects of speaking rate on spectral and temporal characteristics of American English vowels were studied. Five women and five men, native American English speakers without strong regional dialects, produced the carrier phrase sentence “I said hVd again” at slow, conversational, and fast rates. Twelve vowels, /i, ɪ ey, ɛ, æ ɜ, ʌ ɑ, ɔ, ow, ʊ u/ were studied in /hVd/ context. Acoustic analyses were performed using the SPIRE speech analysis programs on a LISP machine. Recordings were digitized at a 16-kHz sampling rate and low-pass filtered. Spectral measurements included fundamental frequency and the first four formants. Temporal measurements made were vowel duration, syllable duration, and closure duration. Durations, duration differences, and rate related changes were quite systematic within and across vowel types. Results indicated that the durations of long vowels and diphthongs were compressed more than that of short vowels as speaking rate increased. Spectral differences were relatively small across speaking rates. The utility of duration as well as spectral information for vowel classification will be discussed. [Work supported by NIH and UTD research grants, and performed in part at the Research Lab. of Electronics, MIT.]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.