Humans and other animals can attend to one of multiple sounds, and follow it selectively over time. The neural underpinnings of this perceptual feat remain mysterious. Some studies have concluded that sounds are heard as separate streams when they activate well-separated populations of central auditory neurons, and that this process is largely pre-attentive. Here, we argue instead that stream formation depends primarily on temporal coherence between responses that encode various features of a sound source. Furthermore, we postulate that only when attention is directed towards a particular feature (e.g., pitch) do all other temporally coherent features of that source (e.g., timbre and location) become bound together as a stream that is segregated from the incoherent features of other sources.
The auditory "scene analysis" problemHumans and other animals routinely detect, identify, and track sounds coming from a particular source (e.g., someone's voice, a conspecific call) amid sounds emanating from other sources (e.g., other voices, heterospecific calls, ambient music, or street traffic) ( Figure 1). The apparent ease with which they determine which components and attributes in a sound mixture arise from the same source belies the complexity of the underlying biological processes. By analogy with the "scene segmentation" problem in vision, this is referred to as the "auditory scene analysis" problem [1](Glossary) or, more colloquially, the "cocktail party" problem [2][3][4]. Understanding how the brain solves this problem is a fundamental challenge facing auditory scientists as it will shed light on the difficulties afflicting the hearing-impaired in multi-source environments [9], and give rise to more effective front-ends for auditory prostheses and automatic speech recognition [10].Recent studies have inspired numerous hypotheses and models concerning the neural underpinnings of perceptual organization in the central auditory system, and especially the auditory cortex (see [3,[7][8][11][12][13][14][15][16][17][18][19][20] for reviews). One prominent hypothesis that underlies most investigations is that sound elements segregate into separate "streams" whenever they activate well separated populations of auditory neurons that are selective to frequency or any other sound attributes that have been shown to support stream segregation (21-30). We shall © 2010 Elsevier Ltd. All rights reserved.Corresponding author: Shihab Shamma, Electrical and Computer Engineering and Institute for Systems Research, University of Maryland, College Park, MD 20742, Tel: 301-405-6842, sas@umd.edu. Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errorsmaybe discovered which could affect the content, and ...