Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking

Ban, Yutong; Girin, Laurent; Alameda-Pineda, Xavier; Horaud, Radu

doi:10.1109/iccvw.2017.60

Cited by 22 publications

(26 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to [7], [16] uses one CGMM for each predefined speaker; the model is plugged into a recursive EM (REM) algorithm in order to update the Multiple Speaker Tracking (Section IV) Speaker tracking methods are generally based on Bayesian inference which combines localization with dynamic models in order to estimate the posterior probability distribution of audio-source directions, e.g. [17]- [19]. Kalman filtering and particle filtering were used in [20] and in [21], respectively, for tracking a single audio source.…”

Section: Introductionmentioning

confidence: 99%

Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

Li¹,

Ban²,

Girin³

et al. 2019

IEEE J. Sel. Top. Signal Process.

Self Cite

View full text Add to dashboard Cite

We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an interchannel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well suited for estimating DP-RTFs associated with moving audio sources. Another crucial ingredient of the proposed method is its ability to properly assign DP-RTFs to audio-source directions. Towards this goal, we adopt a maximum-likelihood formulation and we propose to use exponentiated gradient (EG) to efficiently update source-direction estimates starting from their currently available values. The problem of multiple speaker tracking is computationally intractable because the number of possible associations between observed source directions and physical speakers grows exponentially with time. We adopt a Bayesian framework and we propose a variational approximation of the posterior filtering distribution associated with multiple speaker tracking, as well as an efficient variational expectation maximization (VEM) solver. The proposed online localization and tracking method is thoroughly evaluated using two datasets that contain recordings performed in real environments.

show abstract

Section: Introductionmentioning

confidence: 99%

Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

Li¹,

Ban²,

Girin³

et al. 2019

IEEE J. Sel. Top. Signal Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…This is probably one of the most prominent features of the method, in contrast with most existing audio-visual tracking methods which require continuous and simultaneous flows of visual and audio data. This paper is an extended version of [25] and of [26]. The probabilistic model and its variational approximation were briefly presented in [25] together with preliminary results obtained with three AVDIAR sequences.…”

Section: Related Workmentioning

confidence: 99%

“…This paper is an extended version of [25] and of [26]. The probabilistic model and its variational approximation were briefly presented in [25] together with preliminary results obtained with three AVDIAR sequences. Reverberation-free audio features were used in [26] where it was shown that good performance could be obtained with these features when the audio mapping was trained in one room and tested in another room.…”

Section: Related Workmentioning

confidence: 99%

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

Ban

Alameda-Pineda

Girin

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status -either speaking or silent -of each tracked person along time. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. This may well be viewed as the problem of maximizing the posterior joint distribution of a set of continuous and discrete latent variables given the past and current observations, which is intractable. We propose a variational inference model which amounts to approximate the joint distribution with a factorized distribution. The solution takes the form of a closed-form expectation maximization procedure. We describe in detail the inference algorithm, we evaluate its performance and we compare it with several baseline methods. These experiments show that the proposed audio-visual tracker performs well in informal meetings involving a time-varying number of people.

show abstract

“…To localize moving speakers, a tracking scheme based on Bayesian techniques estimates the posterior distribution of source locations given a sequence of instantaneous estimates of localization features (or of speaker locations) and a dynamic model of source movement, e.g. [12]- [14]. To tackle speech turns, speaker birth and death processes [15] and/or a model of speech activity [16] can be included.…”

Section: Introductionmentioning

confidence: 99%

Online Localization of Multiple Moving Speakers in Reverberant Environments

Li¹,

Mourgue²,

Girin³

et al. 2018

2018 IEEE 10th Sensor Array and Multichannel Signal Processing Workshop (SAM)

Self Cite

View full text Add to dashboard Cite

This paper addresses the problem of online multiple moving speakers localization in reverberant environments. The direct-path relative transfer function (DP-RTF), as defined by the ratio between the first taps of the convolutive transfer function (CTF) of two microphones, encodes the inter-channel direct-path information and is thus used as a localization feature being robust against reverberation. The CTF estimation is based on the cross-relation method. In this work, the recursive least-square method is proposed to solve the cross-relation problem, due to its relatively low computational cost and its good convergence rate. The DP-RTF feature estimated at each time-frequency bin is assumed to correspond to a single speaker. A complex Gaussian mixture model is used to assign each observed feature to one among several speakers. The recursive expectation-maximization algorithm is adopted to update online the model parameters. The method is evaluated with a new dataset containing multiple moving speakers, where the ground-truth speaker trajectories are recorded with a motion capture system.

show abstract

Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking

Cited by 22 publications

References 19 publications

Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

Online Localization of Multiple Moving Speakers in Reverberant Environments

Contact Info

Product

Resources

About