TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

Xu, Yong; Ban, Yutong; Delorme, Guillaume; Gan, Chuang; Rus, Daniela; Alameda-Pineda, Xavier

doi:10.48550/arxiv.2103.15145

Cited by 24 publications

(33 citation statements)

References 50 publications

(131 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to apply Transformer model, DETR [5] treats object detection as a set prediction problem. Transformers are also adopted for Super resolution in [55], Image Colorization in [21], Tracking in [8,54,58], Pose estimation in [29], etc. Besides, for video understanding, there are also recent approaches seek to resolve this challenge using the Transformer networks.…”

Section: Transformers In Computer Visionmentioning

confidence: 99%

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Lin

Liu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Existing action localization approaches adopt shallow temporal convolutional networks (i.e., TCN) on 1D feature map extracted from video frames. In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution. To address this issue, we introduce a novel concept-wise temporal convolution (CTC) layer as an alternative to conventional temporal convolution layer for training deeper action localization networks. Instead of recombining latent concepts, CTC layer deploys a number of temporal filters to each concept separately with shared filter parameters across concepts. Thus can capture common temporal patterns of different concepts and significantly enrich representation ability. Via stacking CTC layers, we proposed a deep concept-wise temporal convolutional network (C-TCN), which boosts the state-of-the-art action localization performance on THUMOS'14 from 42.8 to 52.1 in terms of mAP(%), achieving a relative improvement of 21.7%. Favorable result is also obtained on ActivityNet. CCS CONCEPTS • Computing methodologies → Activity recognition and understanding.

show abstract

Section: Transformers In Computer Visionmentioning

confidence: 99%

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Lin

Liu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…The breakthroughs of the Transformer networks [60] in natural language processing (NLP) domain have sparked the interest of the computer vision community in developing vision transformers for different computer vision tasks, such as image classification [10,40], object detection [4,63,6,40], image segmentation [96,54,63,40], object tracking [80,81], pose estimation [42,58], etc. Among them, DPT [54] adopts a U-shape structure and uses ViT [10] as an encoder to perform semantic segmentation and monocular depth estimation.…”

Section: Vision Transformersmentioning

confidence: 99%

Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction

Zhang

Xie

Barnes

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision transformer networks have shown superiority in many computer vision tasks. In this paper, we take a step further by proposing a novel generative vision transformer with latent variables following an informative energy-based prior for salient object detection. Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation, in which the sampling from the intractable posterior and prior distributions of the latent variables are performed by Langevin dynamics. Further, with the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image. Different from the existing generative models which define the prior distribution of the latent variables as a simple isotropic Gaussian distribution, our model uses an energy-based informative prior which can be more expressive to capture the latent space of the data. We apply the proposed framework to both RGB and RGB-D salient object detection tasks. Extensive experimental results show that our framework can achieve not only accurate saliency predictions but also meaningful uncertainty maps that are consistent with the human perception.35th Conference on Neural Information Processing Systems (NeurIPS 2021).

show abstract

“…TBC [11] explicitly accounts for the object counts inferred from density maps and simultaneously solves detection and tracking. TransCenter [12] is a transformer-based architecture, which handles long-term complex dependencies by using an attention mechanism. However, these methods are limited in terms the degree to which speed can be increased without losing accuracy because there is a trade-off between speed and accuracy.…”

Section: Tracking Based On Detectionmentioning

confidence: 99%

“…However, human detection and feature extraction take a lot of time; hence a rich computational resource is required for real-time tracking. Some methods tackle this problem by simultaneous human detection and feature extraction with a single deep learning model [7,8,9,10,11,12]. However, there is a limitation on the degree to which speed can be increased without losing accuracy.…”

Section: Introductionmentioning

confidence: 99%

SDOF-Tracker: Fast and Accurate Multiple Human Tracking by Skipped-Detection and Optical-Flow

Nishimura¹,

Komorita²,

Kawanishi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multiple human tracking is a fundamental problem for scene understanding. Although both accuracy and speed are required in real-world applications, recent tracking methods based on deep learning have focused on accuracy and require substantial running time. This study aims to improve running speed by performing human detection at a certain frame interval because it accounts for most of the running time. The question is how to maintain accuracy while skipping human detection. In this paper, we propose a method that complements the detection results with optical flow, based on the fact that someone's appearance does not change much between adjacent frames. To maintain the tracking accuracy, we introduce robust interest point selection within human regions and a tracking termination metric calculated by the distribution of the interest points. On the MOT20 dataset in the MOTChallenge, the proposed SDOF-Tracker achieved the best performance in terms of the total running speed while maintaining the MOTA metric. Our code is available at https: //anonymous.4open.science/r/sdof-tracker-75AE.This study aims to resolve the speed-accuracy trade-off problem by bypassing every-frame human detection, which is a computationally heavy task, that is, we perform it at a certain interval. During the interval, human detection is skipped. We named this process Skipped-Detection. The question here is how to complement human detection in skipped frames.

show abstract

TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

Cited by 24 publications

References 50 publications

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction

SDOF-Tracker: Fast and Accurate Multiple Human Tracking by Skipped-Detection and Optical-Flow

Contact Info

Product

Resources

About