Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, and thus can also be applied to image collections. We evaluate our proposed method on three video face clustering datasets. The experiments show that our methods outperform current state-of-the-art methods on all datasets. Video face clustering is lacking a common benchmark as current works are often evaluated with different metrics and/or different sets of face tracks. The datasets and code are available at https://github.com/vivoutlaw/SSIAM.
Cross modal face matching between the thermal and visible spectrum is a much desired capability for night-time surveillance and security applications. Due to a very large modality gap, thermal-to-visible face recognition is one of the most challenging face matching problem. In this paper, we present an approach to bridge this modality gap by a significant margin. Our approach captures the highly non-linear relationship between the two modalities by using a deep neural network. Our model attempts to learn a non-linear mapping from visible to thermal spectrum while preserving the identity information. We show substantive performance improvement on a difficult thermal-visible face dataset (UND-X1). The presented approach improves the state-of-the-art by more than 10% in terms of Rank-1 identification and bridge the drop in performance due to the modality gap by more than 40%.The goal of training the deep network is to learn the projections that can be used to bring the two modalities together. Typically, this would mean regressing the representation from one modality towards the other.We construct a deep network comprising N + 1 layers with m (k) units in the k-th layer, where k = 1, 2, · · · , N. For an input of x ∈ R d , each layer will output a non-linear projection by using the learned projection matrix W and the non-linear activation function g(·). The output of the k-th hidden layer isthe projection matrix to be learned in that layer,is the non-linear activation function. Similarly, the output of the most top level hidden layer can be computed as:where the mapping H :is a parametric non-linear perceptual mapping function learned by the parameters W and b over all the network layers. To determine the parameters W and b for such a mapping, our objective function must seek to minimize the perceptual difference between the visible and thermal training examples in the least mean square sense. We, therefore, formulate the DPM learning as the following optimization problem.arg minThe first term in the objective function corresponds to the simple squared loss between the network outputx given the visible domain input and the corresponding training example t from the thermal domain. The second term in the objective is the regularization term with λ as the regularization parameter. W F is the Frobenius norm of the projection matrix W. Given a training set X = {x 1 , x 2 , · · · , x M } and T = {t 1 ,t 2 , · · · ,t M } from visible and thermal domains respectively, the objective of training is to minimize the function in equation 2 with respect to the parameters W and b.The network is trained on densely computed feature representations (SIFT vectors) from overlapping small regions in the images. This proves Table 1: Performance drop due to Modality gap: Rank-1 identification using 1 image/subject as gallery in Thermal-Thermal and Thermal-Visible matching using baseline features.effective, as the model is able to capture the differing local region's perceptual differences well. The training set comprises of these vector...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.