Video Face Clustering With Unknown Number of Clusters

Tapaswi, Makarand; Law, Marc T.; Fidler, Sanja

doi:10.1109/iccv.2019.00513

Cited by 46 publications

(43 citation statements)

References 49 publications

(90 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, they can generalize well on the unseen videos. Based on the pretrained models, many interesting downstream tasks have been explored, such as deep clustering [30,33], face clustering [34,61], person search [67], person clustering [4], as well as speaker diarization [10,41]. The above verification models are all uni-modal.…”

Section: Related Workmentioning

confidence: 99%

“…Backbone. Pretrained on large-scale datasets, speaker verification, and face recognition models have strong generalization ability and are directly applied in various downstream tasks [4,10,30,33,34,41,61,67]. We follow these works and utilize the backbones of off-the-shelf models [8,17] to encode voice and face features respectively.…”

Section: Audio-visual Relation Networkmentioning

confidence: 99%

“…Input features. Recent face clustering [61], multi-modal clustering [30,33], and verification [49] methods learn the threshold or unified representation based on the embedding vectors extracted by pretrained models. Hence, we also study the influence of inputs for our AVR-Net.…”

Section: Comparisons With State-of-the-artmentioning

confidence: 99%

“…During inference, we use a common sliding window approach to process the input sequence segment by segment for videos of arbitrary length. We randomly select one image from the face tracks because, unlike averaging embedding vectors [61], averaging feature maps is harmful to our AVR-Net and thus does not bring any performance gain in practice. Moreover, cannot-link and must-link constraints are widely used by face clustering algorithms [4,34,61].…”

Section: A Appendixmentioning

confidence: 99%

“…We randomly select one image from the face tracks because, unlike averaging embedding vectors [61], averaging feature maps is harmful to our AVR-Net and thus does not bring any performance gain in practice. Moreover, cannot-link and must-link constraints are widely used by face clustering algorithms [4,34,61]. However, this technique does not work in our system and is ignored consequently.…”

Section: A Appendixmentioning

confidence: 99%

See 4 more Smart Citations

AVA-AVD: Audio-Visual Speaker Diarization in the Wild

Xu,

Song,

Tsutsui

et al. 2021

Preprint

View full text Add to dashboard Cite

Audio-visual speaker diarization aims at detecting "who spoken when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To create a testbed that can effectively compare diarization methods on videos in the wild, we annotate the speaker diarization labels on the AVA movie dataset and create a new benchmark called AVA-AVD. This benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. Yet, how to deal with off-screen and on-screen speakers together still remains a critical challenge. To overcome it, we propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility. Experiments have shown that our method not only can outperform state-of-the-art methods but also is more robust as varying the ratio of off-screen speakers. Ablation studies demonstrate the advantages of the proposed AVR-Net and especially the modality mask on diarization. Our data and code will be made publicly available at https://github.com/zcxueric/AVA-AVD.

show abstract