A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification

Liu, Xuehu; Zhang, Pingping; Yu, Chenyang; Lu, Huchuan; Qian, Xuesheng; Yang, Xiaoyun

doi:10.48550/arxiv.2104.01745

Cited by 8 publications

(13 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, [27,35,48] integrate Transformer layers into the CNN backbone to aggregate hierarchical features and align local features. For video ReID, [28,49] exploit Transformer to aggregate appearance features, spatial features, and temporal features to learn a discriminative representation for a person tracklet.…”

Section: Transformer-based Reidmentioning

confidence: 99%

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Luo¹,

Wang²,

Xu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformer-based supervised pre-training achieves great performance in person re-identification (ReID). However, due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset (e.g. ImageNet-21K) to boost the performance because of the strong data fitting ability of the transformer. To address this challenge, this work targets to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure, respectively. We first investigate self-supervised learning (SSL) methods with Vision Transformer (ViT) pretrained on unlabelled person images (the LUPerson dataset), and empirically find it significantly surpasses ImageNet supervised pre-training models on ReID tasks. To further reduce the domain gap and accelerate the pre-training, the Catastrophic Forgetting Score (CFS) is proposed to evaluate the gap between pre-training and fine-tuning data. Based on CFS, a subset is selected via sampling relevant data close to the down-stream ReID data and filtering irrelevant data from the pre-training dataset. For the model structure, a ReID-specific module named IBN-based convolution stem (ICS) is proposed to bridge the domain gap by learning more invariant features. Extensive experiments have been conducted to fine-tune the pre-training models under supervised learning, unsupervised domain adaptation (UDA), and unsupervised learning (USL) settings. We successfully downscale the LUPerson dataset to 50% with no performance degradation. Finally, we achieve state-of-the-art performance on Market-1501 and MSMT17. For example, our ViT-S/16 achieves 91.3%/89.9%/89.6% mAP accuracy on Market1501 for supervised/UDA/USL ReID. Codes and models will be released to https://github.com/michuanhaohao/ TransReID-SSL.

show abstract

Section: Transformer-based Reidmentioning

confidence: 99%

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Luo¹,

Wang²,

Xu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The MCA measures the correlation among crosshypothesis features and has a similar structure to MSA. The common configuration of MCA uses the same input between keys and values [3,25,42]. However, an issue with this configuration is that it will result in more blocks (e.g., 2M MCA blocks for M hypotheses).…”

Section: Cross-hypothesis Interactionmentioning

confidence: 99%

“…The common configuration of MCA uses the same input between keys and values [3,25,42], i.e., the inputs x = y = z. Instead, we adopt a more efficient strategy by using different inputs, i.e., the inputs x = y = z.…”

Section: Supplementary Materialsmentioning

confidence: 99%

See 1 more Smart Citation

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

Liu

Tang

et al. 2021

Preprint

View full text Add to dashboard Cite

Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves stateof-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at https://github.com/Vegetebird/MHFormer.

show abstract

“…For video-based person Re-ID, Liu et al [25] design a trigeminal network to transform video data into spatial, temporal and spatial-temporal feature spaces. Zhang et al [48] design perceptionconstrained Transformers to decrease the risk of overfitting.…”

Section: Transformer In Visionmentioning

confidence: 99%

HAT: Hierarchical Aggregation Transformers for Person Re-identification

Zhang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, with the advance of deep Convolutional Neural Networks (CNNs), person Re-Identification (Re-ID) has witnessed great success in various applications. However, with limited receptive fields of CNNs, it is still challenging to extract discriminative representations in a global view for persons under non-overlapped cameras. Meanwhile, Transformers demonstrate strong abilities of modeling long-range dependencies for spatial and sequential data. In this work, we take advantages of both CNNs and Transformers, and propose a novel learning framework named Hierarchical Aggregation Transformer (HAT) for image-based person Re-ID with high performance.To achieve this goal, we first propose a Deeply Supervised Aggregation (DSA) to recurrently aggregate hierarchical features from CNN backbones. With multi-granularity supervisions, the DSA can enhance multi-scale features for person retrieval, which is very different from previous methods. Then, we introduce a Transformer-based Feature Calibration (TFC) to integrate low-level detail information as the global prior for high-level semantic information. The proposed TFC is inserted to each level of hierarchical features, resulting in great performance improvements. To our best knowledge, this work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID. Comprehensive experiments on four large-scale Re-ID benchmarks demonstrate that our method shows better results than several state-of-the-art methods. The code is released at https://github.com/AI-Zhpp/HAT.

show abstract

A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification

Cited by 8 publications

References 33 publications

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

HAT: Hierarchical Aggregation Transformers for Person Re-identification

Contact Info

Product

Resources

About