“…The problem of audiovisual tracking involves the estimation of the arrival angle of the audio signal, video detection, filtering and smoothing of the two modalities, fusion and [36] Camera and 2 microphones TDNN Surveillance [1], [11], [15], [32] PF Surveillance and teleconferencing [12], [13], [37] KF, DKF Smart rooms [38] Multiple cameras and microphone arrays LDA Smart rooms [9], [30], [31], [39] PF Meeting rooms finally joint state estimation. Let the target state be defined as y(t) = (x, y, w, h, H), where (x, y) is the center of the ellipse approximating the object shape, (w, h) are the width and height of the bounding box and H is the color histogram of the object.…”