Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction

Nix, Johannes; Kleinschmidt, Michael; Hohmann, Volker

doi:10.21437/eurospeech.2003-252

Cited by 7 publications

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Separation of Speech by Computational Auditory Scene Analysis

Brown

Wang

Signals and Communication Technology

View full text Add to dashboard Cite

Abstract. The term auditory scene analysis (ASA) refers to the ability of human listeners to form perceptual representations of the constituent sources in an acoustic mixture, as in the well-known 'cocktail party' effect. Accordingly, computational auditory scene analysis (CASA) is the field of study which attempts to replicate ASA in machines. Some CASA systems are closely modelled on the known stages of auditory processing, whereas others adopt a more functional approach. However, all are broadly based on the principles underlying the perception and organisation of sound by human listeners, and in this respect they differ from ICA and other approaches to sound separation. In this paper, we review the principles underlying ASA and show how they can be implemented in CASA systems. We also consider the link between CASA and automatic speech recognition, and draw distinctions between the CASA and ICA approaches. IntroductionImagine a recording of a busy party, in which you can hear voices, music and other environmental sounds. How might a computational system process this recording in order to segregate the voice of a particular speaker from the other sources? Independent component analysis (ICA) offers one solution to this problem. However, it is not a solution that has much in common with that adopted by the best-performing sound separation system that we know ofthe human auditory system. Perhaps the key to building a sound separator that rivals human performance is to model human perceptual processing?This argument provides the motivation for the field of computational auditory scene analysis (CASA), which aims to build sound separation systems that adhere to the known principles of human hearing. In this chapter, we review the state-of-the-art in CASA, and consider its similarities and differences with the ICA approach. We also consider the relationship between CASA and techniques for robust automatic speech recognition in noisy environments, and comment on the challenges facing this growing field of study. Auditory Scene AnalysisIn naturalistic listening situations, several sound sources are usually active at the same time, and the pressure variations in air that they generate combine to form a mixture at the ears of the listener. A common example of this is the situation in which the voices of two talkers overlap, as illustrated in Figure 16.1C. The figure shows the simulated auditory nerve response to a mixture of a male and female voice, obtained from a computational model of auditory processing. How can this complex acoustic mixture be parsed in order to retrieve a description of one (or both) of the constituent sources? Bregman [5] was the first to present a coherent answer to this question (see also [17] for a more recent review). He contends that listeners perform an auditory scene analysis (ASA), which can be conceptualised as a two-stage process. In the first stage, the acoustic mixture is decomposed into elements. An element may be regarded as an atomic part of the auditory scene, which 16 Computa...

show abstract

Separation of Speech by Computational Auditory Scene Analysis

Brown

Wang

Signals and Communication Technology

View full text Add to dashboard Cite

show abstract

Monophonic sound source separation with an unsupervised network of spiking neurones

Pichevar

Rouat²

2007

Neurocomputing

View full text Add to dashboard Cite

Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation

Ellis

Weiss

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings

View full text Add to dashboard Cite

A vector quantizer (VQ) trained on short-time frames of a particular source can form an accurate non-parametric model of that source. This principle has been used in several previous source separation and enhancement schemes as a basis for filtering the original mixture. In this paper, we propose the "projection" of a corrupted target signal onto the constrained space represented by the model as a viable model for source separation. We investigate some parameters of VQ encoding, including a more perceptuallymotivated distance measure, and an encoding of phase derivatives that supports reconstruction directly from quantizer output alone. For the problem of separating speech from noise, we highlight some problems with this approach, including the need for sequential constraints (which we introduce with a simple hidden Markov model), and choices for choosing the best quantization for overlapping sources.

show abstract

Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction

Cited by 7 publications

References 9 publications

Separation of Speech by Computational Auditory Scene Analysis

Separation of Speech by Computational Auditory Scene Analysis

Monophonic sound source separation with an unsupervised network of spiking neurones

Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation

Contact Info

Product

Resources

About