Band-importance functions were created using the "compound" technique [Apoux and Healy, J. Acoust. Soc. Am. 132, 1078-1087 (2012)] that accounts for the multitude of synergistic and redundant interactions that take place among speech bands. Functions were created for standard recordings of the speech perception in noise (SPIN) sentences and the Central Institute for the Deaf (CID) W-22 words using 21 critical-band divisions and steep filtering to eliminate the influence of filter slopes. On a given trial, a band of interest was presented along with four other bands having spectral locations determined randomly on each trial. In corresponding trials, the band of interest was absent and only the four other bands were present. The importance of the band of interest was determined by the difference between paired band-present and band-absent trials. Because the locations of the other bands changed randomly from trial to trial, various interactions occurred between the band of interest and other speech bands which provided a general estimate of band importance. Obtained band-importance functions differed substantially from those currently available for identical speech recordings. In addition to differences in the overall shape of the functions, especially for the W-22 words, a complex microstructure was observed in which the importance of adjacent frequency bands often varied considerably. This microstructure may result in better predictive power of the current functions.
The relative importance of temporal information in broad spectral regions for consonant identification was assessed in normal-hearing listeners. For the purpose of forcing listeners to use primarily temporal-envelope cues, speech sounds were spectrally degraded using four-noise-band vocoder processing. Frequency-weighting functions were determined using two methods. The first method consisted of measuring the intelligibility of speech with a hole in the spectrum either in quiet or in noise. The second method consisted of correlating performance with the randomly and independently varied signal-to-noise ratio within each band. Results demonstrated that all bands contributed equally to consonant identification when presented in quiet. In noise, however, both methods indicated that listeners consistently placed relatively more weight upon the highest frequency band. It is proposed that the explanation for the difference in results between quiet and noise relates to the shape of the modulation spectra in adjacent frequency bands. Overall, the results suggest that normal-hearing listeners use a common listening strategy in a given condition. However, this strategy may be influenced by the competing sounds, and thus may vary according to the context. Some implications of the results for cochlear implantees and hearing-impaired listeners are discussed.
The present study investigated the role and relative contribution of envelope and temporal fine structure (TFS) to sentence recognition in noise. Target and masker stimuli were added at five different signal-to-noise ratios (SNRs) and filtered into 30 contiguous frequency bands. The envelope and TFS were extracted from each band by Hilbert decomposition. The final stimuli consisted of the envelope of the target/masker sound mixture at x dB SNR and the TFS of the same sound mixture at y dB SNR. A first experiment showed a very limited contribution of TFS cues, indicating that sentence recognition in noise relies almost exclusively on temporal envelope cues. A second experiment showed that replacing the carrier of a sound mixture with noise (vocoder processing) cannot be considered equivalent to disrupting the TFS of the target signal by adding a background noise. Accordingly, a re-evaluation of the vocoder approach as a model to further understand the role of TFS cues in noisy situations may be necessary. Overall, these data are consistent with the view that speech information is primarily extracted from the envelope while TFS cues are primarily used to detect glimpses of the target.
Speech recognition in noise presumably relies on the number and spectral location of available auditory-filter outputs containing a relatively undistorted view of local target signal properties. The purpose of the present study was to estimate the relative weight of each of the 30 auditory-filter wide bands between 80 and 7563 Hz. Because previous approaches were not compatible with this goal, a technique was developed. Similar to the "hole" approach, the weight of a given band was assessed by comparing intelligibility in two conditions differing in only one aspect-the presence or absence of the band of interest. In contrast to the hole approach, however, random gaps were also created in the spectrum. These gaps were introduced to render the auditory system more sensitive to the removal of a single band and their location was randomized to provide a general view of the weight of each band, i.e., irrespective of the location of information elsewhere in the spectrum. Frequency-weighting functions derived using this technique confirmed the main contribution of the 400-2500 Hz frequency region. However, they revealed a complex microstructure, contrasting with the "bell curve" shape typically reported.
The number of auditory filter outputs required to identify phonemes was estimated in two experiments. Stimuli were divided into 30 contiguous equivalent rectangular bandwidths (ERBN) spanning 80 to 7563 Hz. Normal-hearing listeners were presented with limited numbers of bands having frequency locations determined randomly from trial to trial to provide a general view, i.e., irrespective of specific band location, of the number of 1-ERBN-wide speech bands needed to identify phonemes. The first experiment demonstrated that 20 such bands are required to accurately identify vowels, and 16 are required to identify consonants. In the second experiment, speech-shaped noise or time-reversed speech was introduced to the non-speech bands at various signal-to-noise ratios. Considerably elevated noise levels were necessary to substantially affect phoneme recognition, confirming a high degree of channel independence in the auditory system. The independence observed between auditory filter outputs supports current views of speech recognition in noise in which listeners extract and combine pieces of information randomly distributed both in time and frequency. These findings also suggest that the ability to partition incoming sounds into a large number of narrow bands, an ability often lost in cases of hearing impairment or cochlear implantation, is critical for speech recognition in noise.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.