Training image: 255x255x3 Test image: 255x255x3 17x17x32 49x49x32 Correlation Filter Crop ★ 33x33x1 CNN CNN 49x49x32 Figure 1: Overview of the proposed network architecture, CFNet. It is an asymmetric Siamese network: after applying the same convolutional feature transform to both input images, the "training image" is used to learn a linear template, which is then applied to search the "test image" by cross-correlation.
AbstractThe Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates. * Equal first authorship. is challenging. This problem emerges naturally in applications such as visual object tracking, where the goal is to re-detect an object over a video with the sole supervision of a bounding box at the beginning of the sequence. The main challenge is the lack of a-priori knowledge of the target object, which can be of any class.The simplest approach is to disregard the lack of a-priori knowledge and adapt a pre-trained deep convolutional neural network (CNN) to the target, for example by using stochastic gradient descent (SGD), the workhorse of deep network optimization [32,26,36]. The extremely limited training data and large number of parameters make this a difficult learning problem. Furthermore, SGD is quite expensive for online adaptation [32,26].