This study introduces a model for solving three different auditory tasks in a multi-talker setting: target localization, target identification, and word recognition. The model was used to simulate psychoacoustic data from a call-sign-based listening test involving multiple spatially separated talkers [Brungart and Simpson (2007). Percept. Psychophys. 69(1), 79–91]. The main characteristics of the model are (i) the extraction of salient auditory features (“glimpses”) from the multi-talker signal and (ii) the use of a classification method that finds the best target hypothesis by comparing feature templates from clean target signals to the glimpses derived from the multi-talker mixture. The four features used were periodicity, periodic energy, and periodicity-based interaural time and level differences. The model results widely exceeded probability of chance for all subtasks and conditions, and generally coincided strongly with the subject data. This indicates that, despite their sparsity, glimpses provide sufficient information about a complex auditory scene. This also suggests that complex source superposition models may not be needed for auditory scene analysis. Instead, simple models of clean speech may be sufficient to decode even complex multi-talker scenes.
Human listeners robustly decode speech information from a talker of interest that is embedded in a mixture of spatially distributed interferers. A relevant question is which time-frequency segments of the speech are predominantly used by a listener to solve such a complex Auditory Scene Analysis task. A recent psychoacoustic study investigated the relevance of low signal-to-noise ratio (SNR) components of a target signal on speech intelligibility in a spatial multitalker situation. For this, a three-talker stimulus was manipulated in the spectro-temporal domain such that target speech time-frequency units below a variable SNR threshold (SNR ) were discarded while keeping the interferers unchanged. The psychoacoustic data indicate that only target components at and above a local SNR of about 0 dB contribute to intelligibility. This study applies an auditory scene analysis "glimpsing" model to the same manipulated stimuli. Model data are found to be similar to the human data, supporting the notion of "glimpsing," that is, that salient speech-related information is predominantly used by the auditory system to decode speech embedded in a mixture of sounds, at least for the tested conditions of three overlapping speech signals. This implies that perceptually relevant auditory information is sparse and may be processed with low computational effort, which is relevant for neurophysiological research of scene analysis and novelty processing in the auditory system.
This study investigated the influence of high-frequency cue bands on the detection and discrimination of low-frequency target bands presented in a 3000-Hz low-pass noise masker. Target and cue bands were complex tones with 80-Hz spacing. The cue band consisted of 60 components starting at 4000 Hz; targets consisted of four components starting at different frequencies (500, 700, 1000, 1200, and 1500 Hz). Targets were presented with different durations within the 500-ms masker; target and cue bands had a common on- and offset. Presentation of the high-frequency complex tone significantly enhanced both the discrimination and detection thresholds by 2–3 dB.
A recent study showed that human listeners are able to localize a short speech target simultaneously masked by four speech tokens in reverberation [Kopčo, Best, and Carlile (2010). J. Acoust. Soc. Am. 127, 1450-1457]. Here, an auditory model for solving this task is introduced. The model has three processing stages: (1) extraction of the instantaneous interaural time difference (ITD) information, (2) selection of target-related ITD information ("glimpses") using a template-matching procedure based on periodicity, spectral energy, or both, and (3) target location estimation. The model performance was compared to the human data, and to the performance of a modified model using an ideal binary mask (IBM) at stage (2). The IBM-based model performed similarly to the subjects, indicating that the binaural model is able to accurately estimate source locations. Template matching using spectral energy and using a combination of spectral energy and periodicity achieved good results, while using periodicity alone led to poor results. Particularly, the glimpses extracted from the initial portion of the signal were critical for good performance. Simulation data show that the auditory features investigated here are sufficient to explain human performance in this challenging listening condition and thus may be used in models of auditory scene analysis.
A temporally acute binaural system can help to resolve inherent fluctuations in binaural information that are often present in complex auditory scenes. Using a broadband noise stimulus that rapidly alternates between two different values of interaural time difference (ITD), the ability of the binaural system to hear the lateral position resulting from one of the ITD values was investigated. Results show that listeners are able to accurately lateralize brief noise tokens of only 3-7 ms in duration. In two subsequent experiments, the role of an amplitude modulation (AM) imposed on the ITD-switching stimulus used in the first experiment was tested. For wideband stimuli, the temporal position of the ITD target relative to the phase of the AM did not influence absolute lateralization or detection performance. When the stimuli were narrowband, however, detection of the ITD target was best when temporally positioned in the rising portion of the AM. These experiments illustrate that the auditory system is capable of making accurate lateral estimates of very brief moments of ITD information. Furthermore, for these instantaneous changes in ITD information, the stimulus bandwidth can influence the role of envelope cues for the readout of binaural information.
For realistic listening conditions, interaural cues will fluctuate due to the presence of multiple active sources. If it is assumed that the binaural system is sluggish, then the perceived location of the sound input would be an average of the varying interaural cues. If, however, the binaural system is fast enough to assess the rapidly changing interaural differences, then it could be possible for the binaural system to properly identify the spatial position of a target source. Using a continuous, broadband noise stimulus that contained periodically alternating interaural time differences (ITD) and, notably, no monaural cues, we investigated the binaural system's ability to lateralize brief durations of the target ITD. Results show that listeners can lateralize targets for durations of 3–6 ms indicating that the binaural system allows for a segregation and lateralization of the target and interfering noise streams. Furthermore, results indicate that the binaural system mediates the buildup of a modulated stream. A second experiment investigating whether the salience of the target ITD in the aforementioned stimulus depends on the temporal position of the target within the phase of an amplitude modulated envelope revealed that this was not the case.
In many everyday situations, listeners are confronted with complex acoustic scenes. Despite the complexity of these scenes, they are able to follow and understand one particular talker. This contribution presents auditory models that aim to solve speech-related tasks in multi-talker settings. The main characteristics of the models are: (1) restriction to salient auditory features (“glimpses”); (2) usage of periodicity, periodic energy, and binaural features; and (3) template-based classification methods using clean speech models. Further classification approaches using state-space models will be discussed. The model performance is evaluated on the basis of human psychoacoustic data [e.g., Brungart and Simpson, Perception & Psychophysics, 2007, 69(1), 79-91; Schoenmaker and van de Par, Physiology, Psychoacoustics and Cognition in Normal and Impaired Hearing, 2016, 73-81]. The model results were mostly found to be similar to the subject results. This suggests that sparse glimpses of periodicity-related monaural and binaural auditory features provide sufficient information about a complex auditory scene involving multiple talkers. Furthermore, it can be concluded that the usage of clean speech models is sufficient to decode speech information from the glimpses derived from a complex scene, i.e., computationally complex models of sound source superposition are not required for decoding a speech stream.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.