Abstract:We propose an audiovisual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of a camera and a small microphone array. After extracting audiovisual cues from individual modalities we fuse them adaptively using their reliability in a particle filter framework. The reliability of the audio signal is measured based on the maximum Global Coherence Field (GCF) peak value at each frame. The visual reliability is based on colour-histogram matching with detection results com… Show more
“…This corresponds to a small process noise, which cannot be handled efficiently using PFs. However, the PF baseline method from [32] shows a performance comparable to the EKF-based methods, which indicates that the MDF approach used in this algorithm is efficient on this dataset.…”
Section: B Audiovisual Tracking Performance Analysismentioning
confidence: 91%
“…A comparison of the Bayesian filtering framework proposed in this study with state-of-the-art audiovisual speaker tracking methods is the primary focus of the second evaluation scenario. Four different frameworks were selected as baseline methods: the standard EKF with audiovisual observations, the audiovisual fusion technique based on an iterated EKF as proposed by Gehring et al [30], the PF-based approach with adaptive particle weighting introduced by Gerlach et al [31] and the recently proposed framework by Qian et al [32], which explicitly incorporates sensor reliability measures into the weighting stage of the PF. These methods are compared with the ODSW-EKF with Dirichlet prior and a DSW-EKF with corresponding prediction model based on the logistic function, as introduced in Sec.…”
Section: B Audiovisual Tracking Performance Analysismentioning
confidence: 99%
“…The framework provided explicit control over the individual contributions of acoustic and visual observations via exponential weighting parameters, which were determined a-priori using a grid-search. A recently proposed algorithm for speaker tracking has explicitly considered sensor reliability measures within a particle filtering framework [32]. This work utilized the peak value of the acoustic global coherence field and the correlation between a color-histogram template and the detected face as features, which affected the weighting and resampling step of the particle filter.…”
Data fusion plays an important role in many technical applications that require efficient processing of multimodal sensory observations. A prominent example is audiovisual signal processing, which has gained increasing attention in automatic speech recognition, speaker localization and related tasks. If appropriately combined with acoustic information, additional visual cues can help to improve the performance in these applications, especially under adverse acoustic conditions. A dynamic weighting of acoustic and visual streams based on instantaneous sensor reliability measures is an efficient approach to data fusion in this context. This paper presents a framework that extends the well-established theory of nonlinear dynamical systems with the notion of dynamic stream weights for an arbitrary number of sensory observations. It comprises a recursive state estimator based on the Gaussian filtering paradigm, which incorporates dynamic stream weights into a framework closely related to the extended Kalman filter. Additionally, a convex optimization approach to estimate oracle dynamic stream weights in fully observed dynamical systems utilizing a Dirichlet prior is presented. This serves as a basis for a generic parameter learning framework of dynamic stream weight estimators. The proposed system is application-independent and can be easily adapted to specific tasks and requirements. A study using audiovisual speaker tracking tasks is considered as an exemplary application in this work. An improved tracking performance of the dynamic stream weight-based estimation framework over state-of-the-art methods is demonstrated in the experiments.
“…This corresponds to a small process noise, which cannot be handled efficiently using PFs. However, the PF baseline method from [32] shows a performance comparable to the EKF-based methods, which indicates that the MDF approach used in this algorithm is efficient on this dataset.…”
Section: B Audiovisual Tracking Performance Analysismentioning
confidence: 91%
“…A comparison of the Bayesian filtering framework proposed in this study with state-of-the-art audiovisual speaker tracking methods is the primary focus of the second evaluation scenario. Four different frameworks were selected as baseline methods: the standard EKF with audiovisual observations, the audiovisual fusion technique based on an iterated EKF as proposed by Gehring et al [30], the PF-based approach with adaptive particle weighting introduced by Gerlach et al [31] and the recently proposed framework by Qian et al [32], which explicitly incorporates sensor reliability measures into the weighting stage of the PF. These methods are compared with the ODSW-EKF with Dirichlet prior and a DSW-EKF with corresponding prediction model based on the logistic function, as introduced in Sec.…”
Section: B Audiovisual Tracking Performance Analysismentioning
confidence: 99%
“…The framework provided explicit control over the individual contributions of acoustic and visual observations via exponential weighting parameters, which were determined a-priori using a grid-search. A recently proposed algorithm for speaker tracking has explicitly considered sensor reliability measures within a particle filtering framework [32]. This work utilized the peak value of the acoustic global coherence field and the correlation between a color-histogram template and the detected face as features, which affected the weighting and resampling step of the particle filter.…”
Data fusion plays an important role in many technical applications that require efficient processing of multimodal sensory observations. A prominent example is audiovisual signal processing, which has gained increasing attention in automatic speech recognition, speaker localization and related tasks. If appropriately combined with acoustic information, additional visual cues can help to improve the performance in these applications, especially under adverse acoustic conditions. A dynamic weighting of acoustic and visual streams based on instantaneous sensor reliability measures is an efficient approach to data fusion in this context. This paper presents a framework that extends the well-established theory of nonlinear dynamical systems with the notion of dynamic stream weights for an arbitrary number of sensory observations. It comprises a recursive state estimator based on the Gaussian filtering paradigm, which incorporates dynamic stream weights into a framework closely related to the extended Kalman filter. Additionally, a convex optimization approach to estimate oracle dynamic stream weights in fully observed dynamical systems utilizing a Dirichlet prior is presented. This serves as a basis for a generic parameter learning framework of dynamic stream weight estimators. The proposed system is application-independent and can be easily adapted to specific tasks and requirements. A study using audiovisual speaker tracking tasks is considered as an exemplary application in this work. An improved tracking performance of the dynamic stream weight-based estimation framework over state-of-the-art methods is demonstrated in the experiments.
“…More recently, audio-visual trackers based on particle filtering (PF) and probability hypothesis density (PHD) filters were proposed, e.g. [4]- [7], [20]- [22]. In [6] DOAs of audio sources to guide the propagation of particles and combined the filter with a mean-shift algorithm to reduce the computational complexity.…”
In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status -either speaking or silent -of each tracked person along time. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. This may well be viewed as the problem of maximizing the posterior joint distribution of a set of continuous and discrete latent variables given the past and current observations, which is intractable. We propose a variational inference model which amounts to approximate the joint distribution with a factorized distribution. The solution takes the form of a closed-form expectation maximization procedure. We describe in detail the inference algorithm, we evaluate its performance and we compare it with several baseline methods. These experiments show that the proposed audio-visual tracker performs well in informal meetings involving a time-varying number of people.
“…Alternatively, [7] used a Markov chain Monte Carlo particle filter (MCMC-PF) to increase sampling efficiency. Still in a particle filter tracking framework, [8] proposed to use the maximum global coherence field of the audio signal and image colorhistogram matching to adapt the reliability of audio and visual information. Finally, along a different line, [9] used visual tracking information to assist source separation and beamforming.…”
Multi-speaker tracking is a central problem in humanrobot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.