In order to account for the phenomenon of virtual pitch, various theories assume implicitly or explicitly that each spectral component introduces a series of subharmonics. The spectral-compression method for pitch determination can be viewed as a direct implementation of this principle. The widespread application of this principle in pitch determination is, however, impeded by numerical problems with respect to accuracy and computational efficiency. A modified algorithm is described that solves these problems. Its performance is tested for normal speech and "telephone" speech, i.e., speech high-pass filtered at 300 Hz. The algorithm out-performs the harmonic-sieve method for pitch determination, while its computational requirements are about the same. The algorithm is described in terms of nonlinear system theory, i.c., subharmonic summation. It is argued that the favorable performance of the subharmonic-summation algorithm stems from its corresponding more closely with current pitch-perception theories than does the harmonic sieve.
In intonation research, prominence-lending pitch movements have either been described on a linear or on a logarithmic frequency scale. An experiment has been carried out to check whether pitch movements in speech intonation are perceived on one of these two scales or on a psycheacoustic scale representing the frequency selectivity of the auditory system. This last scale is intermediary between the other two scales. Subjects matched the excursion size of prominence-lending pitch movements in utterances resynthesized in different pitch registers. Their task was to adjust the excursion size in a comparison stimulus in such a way that it lent equal prominence to the corresponding syllable in a fixed test stimulus. The comparison stimulus and the test stimulus had pitches running parallel on either the logarithmic frequency scale, the psycheacoustic scale, or the linear frequency scale. In one-half of the experimental sessions, the test stimulus was presented in the low register, while the comparison stimulus was presented in the high register, and, conversely, for the other half of the sessions. The result is that, in all cases, stimuli are matched in such a way that the average excursion sizes in different registers are equal on the psycheacoustic scale.
It has been shown that visual display systems of intonation can be employed beneficially in teaching intonation to persons with deafness and in teaching the intonation of a foreign language. In current training situations the correctness of a reproduced pitch contour is rated either by the teacher or automatically. In the latter case an algorithm mostly estimates the maximum deviation from an example contour. In game-like exercises, for instance, the pupil has to produce a pitch contour within the displayed floor and ceiling of a "tunnel" with a preadjusted height. In an experiment described in the companion paper, phoneticians had rated the dissimilarity of two pitch contours both auditorily, by listening to two resynthesized utterances, and visually, by looking at two pitch contours displayed on a computer screen. A test is reported in which these dissimilarity ratings were compared with automatic ratings obtained with this tunnel measure and with three other measures, the mean distance, the root-mean-square (RMS) distance, and the correlation coefficient. The most frequently used tunnel measure appeared to have the weakest correlation with the ratings by the phoneticians. In general, the automatic ratings obtained with the correlation coefficient showed the strongest correlation with the perceptual ratings. A disadvantage of this measure, however, may be that it normalizes for the range of the pitch contours. If range is important, as in intonation teaching to persons with deafness, the mean distance or the RMS distance are the best physical measures for automatic training of intonation.
An algorithm is presented that correctly detects the large majority of vowel onsets in fluent speech. The algorithm is based on the simple assumption that vowel onsets are characterized by the appearance of rapidly increasing resonance peaks in the amplitude spectrum. Application to carefully articulated, isolated words results in a high number of false alarms, predominantly before consonants that can function as vowels in a different context such as another language or as a syllabic consonant. After applying some modifications in the setting of some parameters, this number of false alarms for isolated words can be reduced significantly, without the risk of a large number of missed detections. The temporal accuracy of the algorithm is better than 20 ms. This accuracy is determined with respect to the perceptual moment of occurrence of a vowel onset as determined by a phonetician.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.