The "cocktail party problem" was studied using virtual stimuli whose spatial locations were generated using anechoic head-related impulse responses from the AUDIS database [Blauert et al., J. Acoust. Soc. Am. 103, 3082 (1998)]. Speech reception thresholds (SRTs) were measured for Harvard IEEE sentences presented from the front in the presence of one, two, or three interfering sources. Four types of interferer were used: (1) other sentences spoken by the same talker, (2) time-reversed sentences of the same talker, (3) speech-spectrum shaped noise, and (4) speech-spectrum shaped noise, modulated by the temporal envelope of the sentences. Each interferer was matched to the spectrum of the target talker. Interferers were placed in several spatial configurations, either coincident with or separated from the target. Binaural advantage was derived by subtracting SRTs from listening with the "better monaural ear" from those for binaural listening. For a single interferer, there was a binaural advantage of 2-4 dB for all interferer types. For two or three interferers, the advantage was 2-4 dB for noise and speech-modulated noise, and 6-7 dB for speech and time-reversed speech. These data suggest that the benefit of binaural hearing for speech intelligibility is especially pronounced when there are multiple voiced interferers at different locations from the target, regardless of spatial configuration; measurements with fewer or with other types of interferers can underestimate this benefit.
Six experiments explored why the identification of the two members of a pair of diotic, simultaneous, steady-state vowels improves with a difference in fundamental frequency (delta F0). Experiment 1 confirmed earlier reports that a delta F0 improves identification of 200-ms but not 50-ms duration "double vowels"; identification improves up to 1 semitone delta F0 and then asymptotes. In such stimuli, all the formants of a given vowel are excited by the same F0, providing listeners with a potential grouping cue. Subsequent experiments asked whether the improvement in identification with delta F0 for the longer vowels was due to listeners using the consistent F0 within each vowel of a pair to group formants appropriately. Individual vowels were synthesized with a different F0 in the region of the first formant peak from in the region of the higher formant peaks. Such vowels were then paired so that the first formant of one vowel bore the same F0 as the higher formants of the other vowel. These across-formant inconsistencies in F0 did not substantially reduce the previous improvement in identification rates with increasing delta F0's of up to 4 semitones (experiment 2). The subjects' improvement with increasing delta F0 in the inconsistent condition was not produced by identifying vowels on the basis of information in the first-formant or higher-formant regions alone, since stimuli which contained either of these regions in isolation were difficult for subjects to identify. In addition, the inconsistent condition did produce poorer identification for larger delta F0's (experiment 3). The improvement in identification with delta F0 found for the inconsistent stimuli persisted when the delta F0 between vowel pairs was confined to the first formant region (experiment 4) but not when it was confined to the higher formants (experiment 6). The results replicate at different overall presentation levels (experiment 5). The experiments show that at small delta F0's only the first-formant region contributes to improvements in identification accuracy, whereas with larger delta F0's the higher formant region may also contribute. This difference may be related to other results that demonstrate the superiority of resolved rather than unresolved harmonics in coding pitch.
Four experiments investigated the effect of the fundamental frequency ͑F0͒ contour on speech intelligibility against interfering sounds. Speech reception thresholds ͑SRTs͒ were measured for sentences with different manipulations of their F0 contours. These manipulations involved either reductions in F0 variation, or complete inversion of the F0 contour. Against speech-shaped noise, a flattened F0 contour had no significant impact on SRTs compared to a normal F0 contour; the mean SRT for the flattened contour was only 0.4 dB higher. The mean SRT for the inverted contour, however, was 1.3 dB higher than for the normal F0 contour. When the sentences were played against a single-talker interferer, the overall effect was greater, with a 2.0 dB difference between normal and flattened conditions, and 3.8 dB between normal and inverted. There was no effect of altering the F0 contour of the interferer, indicating that any abnormality of the F0 contour serves to reduce intelligibility of the target speech, but does not alter the masking produced by interfering speech. Low-pass filtering the F0 contour increased SRTs; elimination of frequencies between 2 and 4 Hz had the greatest effect. Filtering sentences with inverted contours did not have a significant effect on SRTs.
Three experiments investigated the roles of interaural time differences (ITDs) and level differences (ILDs) in spatial unmasking in multi-source environments. In experiment 1, speech reception thresholds (SRTs) were measured in virtual-acoustic simulations of an anechoic environment with three interfering sound sources of either speech or noise. The target source lay directly ahead, while three interfering sources were (1) all at the target's location (0 degrees,0 degrees,0 degrees), (2) at locations distributed across both hemifields (-30 degrees,60 degrees,90 degrees), (3) at locations in the same hemifield (30 degrees,60 degrees,90 degrees), or (4) co-located in one hemifield (90 degrees,90 degrees,90 degrees). Sounds were convolved with head-related impulse responses (HRIRs) that were manipulated to remove individual binaural cues. Three conditions used HRIRs with (1) both ILDs and ITDs, (2) only ILDs, and (3) only ITDs. The ITD-only condition produced the same pattern of results across spatial configurations as the combined cues, but with smaller differences between spatial configurations. The ILD-only condition yielded similar SRTs for the (-30 degrees,60 degrees,90 degrees) and (0 degrees,0 degrees,0 degrees) configurations, as expected for best-ear listening. In experiment 2, pure-tone BMLDs were measured at third-octave frequencies against the ITD-only, speech-shaped noise interferers of experiment 1. These BMLDs were 4-8 dB at low frequencies for all spatial configurations. In experiment 3, SRTs were measured for speech in diotic, speech-shaped noise. Noises were filtered to reduce the spectrum level at each frequency according to the BMLDs measured in experiment 2. SRTs were as low or lower than those of the corresponding ITD-only conditions from experiment 1. Thus, an explanation of speech understanding in complex listening environments based on the combination of best-ear listening and binaural unmasking (without involving sound-localization) cannot be excluded.
In the presence of competing speech or noise, reverberation degrades speech intelligibility not only by its direct effect on the target but also by affecting the interferer. Two experiments were designed to validate a method for predicting the loss of intelligibility associated with this latter effect. Speech reception thresholds were measured under headphones, using spatially separated target sentences and speech-shaped noise interferers simulated in virtual rooms. To investigate the effect of reverberation on the interferer unambiguously, the target was always anechoic. The interferer was placed in rooms with different sizes and absorptions, and at different distances and azimuths from the listener. The interaural coherence of the interferer did not fully predict the effect of reverberation. The azimuth separation of the sources and the coloration introduced by the room also had to be taken into account. The binaural effects were modeled by computing the binaural masking level differences in the studied configurations, the monaural effects were predicted from the excitation pattern of the noises, and speech intelligibility index weightings were applied to both. These parameters were all calculated from the room impulse responses convolved with noise. A 0.95-0.97 correlation was obtained between the speech reception thresholds and their predicted value.
Three experiments and a computational model explored the role of within-channel and across-channel processes in the perceptual separation of competing, complex, broadband sounds which differed in their interaural phase spectra. In each experiment, two competing vowels, whose first and second formants were represented by two discrete bands of noise, were presented concurrently, for identification. Experiments 1 and 2 showed that listeners were able to identify the vowels accurately when each was presented to a different ear, but were unable to identify the vowels when they were presented with different interaural time delays (ITDs); i.e. listeners could not group the noisebands in different frequency regions with the same ITD and thereby separate them from bands in other frequency regions with a different ITD. Experiment 3 demonstrated that while listeners were unable to exploit a difference in interaural delay between the pairs of noisebands, listeners could identify a vowel defined by interaurally decorrelated noisebands when the other two noisebands were interaurally correlated. A computational model based upon that of Durlach [J. Acoust. Soc. Am. 32, 1075-1076 (1960)] showed that the results of these and other experiments can be interpreted in terms of a within-channel mechanism, which is sensitive to interaural decorrelation. Thus the across-frequency integration which occurs in the lateralization of complex sounds may play little role in segregating concurrent sounds.
Two experiments investigated the effect of reverberation on listeners' ability to perceptually segregate two competing voices. Culling et al. [Speech Commun. 14, 71-96 (1994)] found that for competing synthetic vowels, masked identification thresholds were increased by reverberation only when combined with modulation of fundamental frequency (F0). The present investigation extended this finding to running speech. Speech reception thresholds (SRTs) were measured for a male voice against a single interfering female voice within a virtual room with controlled reverberation. The two voices were either (1) co-located in virtual space at 0 degrees azimuth or (2) separately located at +/-60 degrees azimuth. In experiment 1, target and interfering voices were either normally intonated or resynthesized with a fixed F0. In anechoic conditions, SRTs were lower for normally intonated and for spatially separated sources, while, in reverberant conditions, the SRTs were all the same. In experiment 2, additional conditions employed inverted F0 contours. Inverted F0 contours yielded higher SRTs in all conditions, regardless of reverberation. The results suggest that reverberation can seriously impair listeners' ability to exploit differences in F0 and spatial location between competing voices. The levels of reverberation employed had no effect on speech intelligibility in quiet.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.