Deep Learning of Appearance Models for Online Object Tracking

Zhai, Mengyao; Chen, Lei; Mori, Greg; Roshtkhari, Mehrsan Javan

doi:10.1007/978-3-030-11018-5_57

Cited by 23 publications

(15 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other approaches use covariance matrix representation, pixel comparison representation, SIFT-like features, or pose features [25,83,29,22,50]. Recently, deep neural network architectures have been used for modeling appearance [21,36,84]. In these architectures, high-level features are extracted by convolutional neural networks trained for a specific task.…”

Section: Appearance Modelmentioning

confidence: 99%

Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies

Sadeghian

Alahi

Savarese

2017

2017 IEEE International Conference on Computer Vision (ICCV)

490

406

View full text Add to dashboard Cite

The majority of existing solutions to the Multi-Target Tracking (MTT) problem do not combine cues in a coherent end-to-end fashion over a long period of time. However, we present an online method that encodes long-term temporal dependencies across multiple cues. One key challenge of tracking methods is to accurately track occluded targets or those which share similar appearance properties with surrounding objects. To address this challenge, we present a structure of Recurrent Neural Networks (RNN) that jointly reasons on multiple cues over a temporal window. We are able to correct many data association errors and recover observations from an occluded state. We demonstrate the robustness of our data-driven approach by tracking multiple targets using their appearance, motion, and even interactions. Our method outperforms previous works on multiple publicly available datasets including the challenging MOT benchmark.

show abstract

Section: Appearance Modelmentioning

confidence: 99%

Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies

Sadeghian

Alahi

Savarese

2017

2017 IEEE International Conference on Computer Vision (ICCV)

490

406

View full text Add to dashboard Cite

show abstract

“…The ImageNet+CF variant employs features taken from a network trained to solve the ImageNet classification challenge [28]. The results show that these features, which are often the first choice for combining CFs with CNNs [7,9,22,26,32,36], are significantly worse than those learned by CFNet and the Baseline experiment. The particularly poor performance of these features at deeper layers is somewhat unsurprising, since these layers are expected to have greater invariance to position when trained for classification.…”

Section: Feature Transfer Experimentsmentioning

confidence: 99%

“…The simplest approach is to disregard the lack of a-priori knowledge and adapt a pre-trained deep convolutional neural network (CNN) to the target, for example by using stochastic gradient descent (SGD), the workhorse of deep network optimization [32,26,36]. The extremely limited training data and large number of parameters make this a difficult learning problem.…”

Section: Introductionmentioning

confidence: 99%

“…This problem emerges naturally in applications such as visual object tracking, where the goal is to re-detect an object over a video with the sole supervision of a bounding box at the beginning of the sequence. The main challenge is the lack of a-priori knowledge of the target object, which can be of any class.The simplest approach is to disregard the lack of a-priori knowledge and adapt a pre-trained deep convolutional neural network (CNN) to the target, for example by using stochastic gradient descent (SGD), the workhorse of deep network optimization [32,26,36]. The extremely limited training data and large number of parameters make this a difficult learning problem.…”

mentioning

confidence: 99%

See 1 more Smart Citation

End-to-End Representation Learning for Correlation Filter Based Tracking

Valmadre

Bertinetto

Henriques

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

1,403

1,136

View full text Add to dashboard Cite

Training image: 255x255x3 Test image: 255x255x3 17x17x32 49x49x32 Correlation Filter Crop ★ 33x33x1 CNN CNN 49x49x32 Figure 1: Overview of the proposed network architecture, CFNet. It is an asymmetric Siamese network: after applying the same convolutional feature transform to both input images, the "training image" is used to learn a linear template, which is then applied to search the "test image" by cross-correlation. AbstractThe Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates. * Equal first authorship. is challenging. This problem emerges naturally in applications such as visual object tracking, where the goal is to re-detect an object over a video with the sole supervision of a bounding box at the beginning of the sequence. The main challenge is the lack of a-priori knowledge of the target object, which can be of any class.The simplest approach is to disregard the lack of a-priori knowledge and adapt a pre-trained deep convolutional neural network (CNN) to the target, for example by using stochastic gradient descent (SGD), the workhorse of deep network optimization [32,26,36]. The extremely limited training data and large number of parameters make this a difficult learning problem. Furthermore, SGD is quite expensive for online adaptation [32,26].

show abstract

“…Many approaches rely on appearance [17,29,57,59,11,33,47], motion [13], or social cues [20,44]. They are mostly used to associate pairs of detections, and only account for very short-term correlations.…”

Section: Related Workmentioning

confidence: 99%

Eliminating Exposure Bias and Metric Mismatch in Multiple Object Tracking

Maksai

Fua

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Identity Switching remains one of the main difficulties Multiple Object Tracking (MOT) algorithms have to deal with. Many state-of-the-art approaches now use sequence models to solve this problem but their training can be affected by biases that decrease their efficiency. In this paper, we introduce a new training procedure that confronts the algorithm to its own mistakes while explicitly attempting to minimize the number of switches, which results in better training.We propose an iterative scheme of building a rich training set and using it to learn a scoring function that is an explicit proxy for the target tracking metric. Whether using only simple geometric features or more sophisticated ones that also take appearance into account, our approach outperforms the state-of-the-art on several MOT benchmarks.

show abstract

Deep Learning of Appearance Models for Online Object Tracking

Cited by 23 publications

References 32 publications

Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies

Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies

End-to-End Representation Learning for Correlation Filter Based Tracking

Eliminating Exposure Bias and Metric Mismatch in Multiple Object Tracking

Contact Info

Product

Resources

About