3D audio-visual speaker tracking with an adaptive particle filter

Qian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, Andrea

doi:10.1109/icassp.2017.7952686

Cited by 26 publications

(31 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This corresponds to a small process noise, which cannot be handled efficiently using PFs. However, the PF baseline method from [32] shows a performance comparable to the EKF-based methods, which indicates that the MDF approach used in this algorithm is efficient on this dataset.…”

Section: B Audiovisual Tracking Performance Analysismentioning

confidence: 91%

“…A comparison of the Bayesian filtering framework proposed in this study with state-of-the-art audiovisual speaker tracking methods is the primary focus of the second evaluation scenario. Four different frameworks were selected as baseline methods: the standard EKF with audiovisual observations, the audiovisual fusion technique based on an iterated EKF as proposed by Gehring et al [30], the PF-based approach with adaptive particle weighting introduced by Gerlach et al [31] and the recently proposed framework by Qian et al [32], which explicitly incorporates sensor reliability measures into the weighting stage of the PF. These methods are compared with the ODSW-EKF with Dirichlet prior and a DSW-EKF with corresponding prediction model based on the logistic function, as introduced in Sec.…”

Section: B Audiovisual Tracking Performance Analysismentioning

confidence: 99%

“…The framework provided explicit control over the individual contributions of acoustic and visual observations via exponential weighting parameters, which were determined a-priori using a grid-search. A recently proposed algorithm for speaker tracking has explicitly considered sensor reliability measures within a particle filtering framework [32]. This work utilized the peak value of the acoustic global coherence field and the correlation between a color-histogram template and the detected face as features, which affected the weighting and resampling step of the particle filter.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Audiovisual Speaker Tracking Using Nonlinear Dynamical Systems With Dynamic Stream Weights

Schymura

Kolossa

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Data fusion plays an important role in many technical applications that require efficient processing of multimodal sensory observations. A prominent example is audiovisual signal processing, which has gained increasing attention in automatic speech recognition, speaker localization and related tasks. If appropriately combined with acoustic information, additional visual cues can help to improve the performance in these applications, especially under adverse acoustic conditions. A dynamic weighting of acoustic and visual streams based on instantaneous sensor reliability measures is an efficient approach to data fusion in this context. This paper presents a framework that extends the well-established theory of nonlinear dynamical systems with the notion of dynamic stream weights for an arbitrary number of sensory observations. It comprises a recursive state estimator based on the Gaussian filtering paradigm, which incorporates dynamic stream weights into a framework closely related to the extended Kalman filter. Additionally, a convex optimization approach to estimate oracle dynamic stream weights in fully observed dynamical systems utilizing a Dirichlet prior is presented. This serves as a basis for a generic parameter learning framework of dynamic stream weight estimators. The proposed system is application-independent and can be easily adapted to specific tasks and requirements. A study using audiovisual speaker tracking tasks is considered as an exemplary application in this work. An improved tracking performance of the dynamic stream weight-based estimation framework over state-of-the-art methods is demonstrated in the experiments.

show abstract

Section: B Audiovisual Tracking Performance Analysismentioning

confidence: 91%

Section: B Audiovisual Tracking Performance Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Audiovisual Speaker Tracking Using Nonlinear Dynamical Systems With Dynamic Stream Weights

Schymura

Kolossa

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…More recently, audio-visual trackers based on particle filtering (PF) and probability hypothesis density (PHD) filters were proposed, e.g. [4]- [7], [20]- [22]. In [6] DOAs of audio sources to guide the propagation of particles and combined the filter with a mean-shift algorithm to reduce the computational complexity.…”

Section: Related Workmentioning

confidence: 99%

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

Ban

Alameda-Pineda

Girin

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status -either speaking or silent -of each tracked person along time. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. This may well be viewed as the problem of maximizing the posterior joint distribution of a set of continuous and discrete latent variables given the past and current observations, which is intractable. We propose a variational inference model which amounts to approximate the joint distribution with a factorized distribution. The solution takes the form of a closed-form expectation maximization procedure. We describe in detail the inference algorithm, we evaluate its performance and we compare it with several baseline methods. These experiments show that the proposed audio-visual tracker performs well in informal meetings involving a time-varying number of people.

show abstract

“…Alternatively, [7] used a Markov chain Monte Carlo particle filter (MCMC-PF) to increase sampling efficiency. Still in a particle filter tracking framework, [8] proposed to use the maximum global coherence field of the audio signal and image colorhistogram matching to adapt the reliability of audio and visual information. Finally, along a different line, [9] used visual tracking information to assist source separation and beamforming.…”

Section: Introductionmentioning

confidence: 99%

Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking

Ban

Girin

Alameda-Pineda

et al. 2017

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

View full text Add to dashboard Cite

Multi-speaker tracking is a central problem in humanrobot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.

show abstract

3D audio-visual speaker tracking with an adaptive particle filter

Cited by 26 publications

References 21 publications

Audiovisual Speaker Tracking Using Nonlinear Dynamical Systems With Dynamic Stream Weights

Audiovisual Speaker Tracking Using Nonlinear Dynamical Systems With Dynamic Stream Weights

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking

Contact Info

Product

Resources

About