“…Localizing and tracking speakers in enclosed spaces using AV information has increasingly attracted attention in signal processing and computer vision [36,17,7,34,13,43,48,1,3,6,5], given the complementary characteristics of each modality. Broadly speaking, the differences among existing works arise from the overall goal (tracking single vs. multiple speakers), the specific detection/tracking framework, and the AV sensor configuration.…”