Abstract:We study the cocktail-party effect, which refers to the ability of a listener to attend to a single talker in the presence of adverse acoustical conditions. It has been observed that this ability improves in the presence of binaural cues. In this paper, we explore a technique for speech segregation based on sound localization cues. The auditory masking phenomenon motivates an "ideal" binary mask in which time-frequency regions that correspond to the weak signal are canceled. In our model we estimate this binar… Show more
“…The processes underlying spatial hearing can be used for the segregation of speech by increasing its SNR [6]. Part of our future work will be directed towards the enhancement of speech recognition systems with the aid of SSL.…”
Section: Discussionmentioning
confidence: 99%
“…Sounds can provide information comparable to visual stimuli in scenarios where vision is impeded. SSL can help robots to cope with environment hazards and to communicate [6]. A meta-objective of artificial SSL systems is their portability to different robots.…”
Abstract. This paper presents a spiking neural network (SNN) for binaural sound source localisation (SSL). The cues used for SSL were the interaural time (ITD) and level (ILD) differences. ITDs and ILDs were extracted with models of the medial superior olive (MSO) and the lateral superior olive (LSO). The MSO and LSO outputs were integrated in a model of the inferior colliculus (IC). The connection weights between the MSO and LSO neurons to the IC neurons were estimated using Bayesian inference. This inference process allowed the algorithm to perform robustly on a robot with ∼40 dB of ego-noise. The results showed that the algorithm is capable of differentiating sounds with an accuracy of 15 • .
“…The processes underlying spatial hearing can be used for the segregation of speech by increasing its SNR [6]. Part of our future work will be directed towards the enhancement of speech recognition systems with the aid of SSL.…”
Section: Discussionmentioning
confidence: 99%
“…Sounds can provide information comparable to visual stimuli in scenarios where vision is impeded. SSL can help robots to cope with environment hazards and to communicate [6]. A meta-objective of artificial SSL systems is their portability to different robots.…”
Abstract. This paper presents a spiking neural network (SNN) for binaural sound source localisation (SSL). The cues used for SSL were the interaural time (ITD) and level (ILD) differences. ITDs and ILDs were extracted with models of the medial superior olive (MSO) and the lateral superior olive (LSO). The MSO and LSO outputs were integrated in a model of the inferior colliculus (IC). The connection weights between the MSO and LSO neurons to the IC neurons were estimated using Bayesian inference. This inference process allowed the algorithm to perform robustly on a robot with ∼40 dB of ego-noise. The results showed that the algorithm is capable of differentiating sounds with an accuracy of 15 • .
“…SNR values for the separated target speech also indicate good separation, and informal listening tests found that target speech extracted by the system was of good quality. SNR performance reported here (10.03 dB at the smallest separation) also compares well with those of [9], although direct comparison is difficult due to differing stimuli and spatial separations. The energy-based mechanism allowing unvoiced segments to be represented in the RTNN binary mask successfully included the utterances' fricatives.…”
Section: Discussionsupporting
confidence: 65%
“…Thus, across-frequency grouping by ITD ought to provide a powerful mechanism for segregating multiple voices. Indeed, across-frequency grouping by ITD has been employed by computational models of voice separation (e.g., [8,9]). …”
Abstract.A speech separation system is described in which sources are represented in a joint interaural time difference-fundamental frequency (ITD-F0) cue space. Traditionally, recurrent timing neural networks (RTNNs) have been used only to extract periodicity information; in this study, this type of network is extended in two ways. Firstly, a coincidence detector layer is introduced, each node of which is tuned to a particular ITD; secondly, the RTNN is extended to become twodimensional to allow periodicity analysis to be performed at each best-ITD. Thus, one axis of the RTNN represents F0 and the other ITD allowing sources to be segregated on the basis of their separation in ITD-F0 space. Source segregation is performed within individual frequency channels without recourse to across-channel estimates of F0 or ITD that are commonly used in auditory scene analysis approaches. The system is evaluated on spatialised speech signals using energy-based metrics and automatic speech recognition.
“…Beamforming attempts to improve SNR of a source using directional information [3,8]. Other approaches perform a timefrequency decomposition of the mixture signals and use between channel level and time delay differences in each time-frequency (T-F) unit to estimate an output signal that originates from a particular direction [8,12,14,18]. These systems use localization information as a primary cue to achieve source segregation, and show rapid performance degradation as reverberation is added to the recordings.…”
Approaches to binaural and stereo speech segregation have often assumed that localization information can be used as a primary cue to achieve segregation of a target signal. Results produced by these systems degrade significantly in the presence of room reverberation. In this work, we present an alternative framework to achieve localization of groups of time-frequency units. We show that grouping across time and frequency allows the use of localization as an important cue for sequential grouping of time-frequency objects. We analyze the level of time-frequency grouping needed to achieve accurate object localization and show preliminary binaural segregation results using the proposed framework. Results indicate that both localization and segregation performance can be improved by grouping across time and frequency.Index Terms -Binaural sound localization, speech segregation, reverberation, computational auditory scene analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.