Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
Abstract. The term auditory scene analysis (ASA) refers to the ability of human listeners to form perceptual representations of the constituent sources in an acoustic mixture, as in the well-known 'cocktail party' effect. Accordingly, computational auditory scene analysis (CASA) is the field of study which attempts to replicate ASA in machines. Some CASA systems are closely modelled on the known stages of auditory processing, whereas others adopt a more functional approach. However, all are broadly based on the principles underlying the perception and organisation of sound by human listeners, and in this respect they differ from ICA and other approaches to sound separation. In this paper, we review the principles underlying ASA and show how they can be implemented in CASA systems. We also consider the link between CASA and automatic speech recognition, and draw distinctions between the CASA and ICA approaches. IntroductionImagine a recording of a busy party, in which you can hear voices, music and other environmental sounds. How might a computational system process this recording in order to segregate the voice of a particular speaker from the other sources? Independent component analysis (ICA) offers one solution to this problem. However, it is not a solution that has much in common with that adopted by the best-performing sound separation system that we know ofthe human auditory system. Perhaps the key to building a sound separator that rivals human performance is to model human perceptual processing?This argument provides the motivation for the field of computational auditory scene analysis (CASA), which aims to build sound separation systems that adhere to the known principles of human hearing. In this chapter, we review the state-of-the-art in CASA, and consider its similarities and differences with the ICA approach. We also consider the relationship between CASA and techniques for robust automatic speech recognition in noisy environments, and comment on the challenges facing this growing field of study. Auditory Scene AnalysisIn naturalistic listening situations, several sound sources are usually active at the same time, and the pressure variations in air that they generate combine to form a mixture at the ears of the listener. A common example of this is the situation in which the voices of two talkers overlap, as illustrated in Figure 16.1C. The figure shows the simulated auditory nerve response to a mixture of a male and female voice, obtained from a computational model of auditory processing. How can this complex acoustic mixture be parsed in order to retrieve a description of one (or both) of the constituent sources? Bregman [5] was the first to present a coherent answer to this question (see also [17] for a more recent review). He contends that listeners perform an auditory scene analysis (ASA), which can be conceptualised as a two-stage process. In the first stage, the acoustic mixture is decomposed into elements. An element may be regarded as an atomic part of the auditory scene, which 16 Computa...
Abstract. The term auditory scene analysis (ASA) refers to the ability of human listeners to form perceptual representations of the constituent sources in an acoustic mixture, as in the well-known 'cocktail party' effect. Accordingly, computational auditory scene analysis (CASA) is the field of study which attempts to replicate ASA in machines. Some CASA systems are closely modelled on the known stages of auditory processing, whereas others adopt a more functional approach. However, all are broadly based on the principles underlying the perception and organisation of sound by human listeners, and in this respect they differ from ICA and other approaches to sound separation. In this paper, we review the principles underlying ASA and show how they can be implemented in CASA systems. We also consider the link between CASA and automatic speech recognition, and draw distinctions between the CASA and ICA approaches. IntroductionImagine a recording of a busy party, in which you can hear voices, music and other environmental sounds. How might a computational system process this recording in order to segregate the voice of a particular speaker from the other sources? Independent component analysis (ICA) offers one solution to this problem. However, it is not a solution that has much in common with that adopted by the best-performing sound separation system that we know ofthe human auditory system. Perhaps the key to building a sound separator that rivals human performance is to model human perceptual processing?This argument provides the motivation for the field of computational auditory scene analysis (CASA), which aims to build sound separation systems that adhere to the known principles of human hearing. In this chapter, we review the state-of-the-art in CASA, and consider its similarities and differences with the ICA approach. We also consider the relationship between CASA and techniques for robust automatic speech recognition in noisy environments, and comment on the challenges facing this growing field of study. Auditory Scene AnalysisIn naturalistic listening situations, several sound sources are usually active at the same time, and the pressure variations in air that they generate combine to form a mixture at the ears of the listener. A common example of this is the situation in which the voices of two talkers overlap, as illustrated in Figure 16.1C. The figure shows the simulated auditory nerve response to a mixture of a male and female voice, obtained from a computational model of auditory processing. How can this complex acoustic mixture be parsed in order to retrieve a description of one (or both) of the constituent sources? Bregman [5] was the first to present a coherent answer to this question (see also [17] for a more recent review). He contends that listeners perform an auditory scene analysis (ASA), which can be conceptualised as a two-stage process. In the first stage, the acoustic mixture is decomposed into elements. An element may be regarded as an atomic part of the auditory scene, which 16 Computa...
A vector quantizer (VQ) trained on short-time frames of a particular source can form an accurate non-parametric model of that source. This principle has been used in several previous source separation and enhancement schemes as a basis for filtering the original mixture. In this paper, we propose the "projection" of a corrupted target signal onto the constrained space represented by the model as a viable model for source separation. We investigate some parameters of VQ encoding, including a more perceptuallymotivated distance measure, and an encoding of phase derivatives that supports reconstruction directly from quantizer output alone. For the problem of separating speech from noise, we highlight some problems with this approach, including the need for sequential constraints (which we introduce with a simple hidden Markov model), and choices for choosing the best quantization for overlapping sources.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.