CAViT: Contextual Alignment Vision Transformer for Video Object Re-identification

Wu, Jinlin; He, Lingxiao; Liu, Wu; Yang, Yang; Lei, Zhen; Mei, Tao; Li, Stan Z.

doi:10.1007/978-3-031-19781-9_32

Cited by 7 publications

(14 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, the proposed method also outperformed TCLNet [18] and BiCnet-TKS [42], which usesimilar diverse attention-based methods, with an improvement of up to 1.4%/2.2% and 1.2%/1.8% mAP/Rank-1 accuracy in MARS, respectively. Further, ST-MGA outperformed several recent models (i.e., SINet [37], CAVIT [38], HMN [40], SGMN [41], and BIC+LGCN [42]). In particular, the proposed method shows higher accuracy than the complex transformer-based method [38], which has recently attracted attention.…”

Section: The Influence Of Granularitymentioning

confidence: 87%

“…Further, ST-MGA outperformed several recent models (i.e., SINet [37], CAVIT [38], HMN [40], SGMN [41], and BIC+LGCN [42]). In particular, the proposed method shows higher accuracy than the complex transformer-based method [38], which has recently attracted attention. The above results verify the effectiveness and superiority of ST-MGA in video ReID.…”

Section: The Influence Of Granularitymentioning

confidence: 87%

“…In recent years, video-based person ReID [7,14,15,[17][18][19][20][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42][43] has garnered significant attention due to the abundant temporal and spatial cues available in videos. The predominant approach in video ReID is extracting and aggregating dynamic spatiotemporal features.…”

Section: Video-based Person Reidmentioning

confidence: 99%

“…Then, we efficiently the aggregate spatiotemporal partial information using the Spatiotemporal Multi-Granularity Aggregation (ST-MGA) scheme to extract complementary video features. Proposed multi-scale 3D-convolution layer to refine the temporal features STA [7] AAAI '19 Proposed spatial-temporal attention approach to fully exploit discriminative parts GLTR [27] ICCV '19 Proposed global-local temporal representation to exploit multi-scale temporal cues RGSA [29] AAAI'20 Designed relation-guided spatial-attention module to explore discriminative regions FGRA [30] AAAI'20 Proposed frame-guided region-aligned model to extract well-aligned part features MG-RAFA [26] CVPR'20 Suggested attentive feature aggregation with multi-granularity information PhD [31] CVPR'20 Proposed Pompeiu-Hausdorff distance learning to alleviate the data noise problem STGCN [14] CVPR'20 Jointly optimized two GCN branches in spatial and temporal dimensions for complementary information MGH [15] CVPR'20 Designed a multi-granular hypergraph structure to increase representational capacities TCLNet [18] ECCV'20 Introduced a temporal-saliency-erasing module to focus on diverse part information AP3D [32] ECCV'20 Proposed appearance-preserving 3D-convolution to align the adjacent features at the pixel level AFA [33] ECCV'20 Proposed adversarial feature augmentation, which highlights the temporal coherence features SSN3D [34] AAAI'21 Designed a self-separated network to seek out the same parts in different frames BiCnet-TKS [19] CVPR'21 Used multiple parallel and diverse attention modules to discover diverse body parts STMN [35] ICCV'21 Leveraged spatial and temporal memories to refine frame-/sequence-level representations STRF [36] ICCV'21 Proposed spatiotemporal representation factorization for learning discriminative features SINet [37] CVPR'22 Designed SINet to enlarge attention regions for consecutive frames gradually CAVIT [38] ECCV'22 Used a contextual alignment vision transformer for spatiotemporal interaction SANet [39] TCSVT'22 Introduced the SA block, which can capture long-range and high-order information HMN [40] TCSVT '22 Designed hierarchical mining network which can mine as many characteristics SGMN [41] TCSVT'22 Designed a saliency-and granularity-mining network for discovering temporal coherence BIC+LGCN [42] TCSVT'23 Used a branch architecture to separately learn appearance featur...…”

Section: Video-based Person Reidmentioning

confidence: 99%

See 3 more Smart Citations

Multi-Granularity Aggregation with Spatiotemporal Consistency for Video-Based Person Re-Identification

Lee,

Kim,

Jang

et al. 2024

Sensors

View full text Add to dashboard Cite

Video-based person re-identification (ReID) aims to exploit relevant features from spatial and temporal knowledge. Widely used methods include the part- and attention-based approaches for suppressing irrelevant spatial–temporal features. However, it is still challenging to overcome inconsistencies across video frames due to occlusion and imperfect detection. These mismatches make temporal processing ineffective and create an imbalance of crucial spatial information. To address these problems, we propose the Spatiotemporal Multi-Granularity Aggregation (ST-MGA) method, which is specifically designed to accumulate relevant features with spatiotemporally consistent cues. The proposed framework consists of three main stages: extraction, which extracts spatiotemporally consistent partial information; augmentation, which augments the partial information with different granularity levels; and aggregation, which effectively aggregates the augmented spatiotemporal information. We first introduce the consistent part-attention (CPA) module, which extracts spatiotemporally consistent and well-aligned attentive parts. Sub-parts derived from CPA provide temporally consistent semantic information, solving misalignment problems in videos due to occlusion or inaccurate detection, and maximize the efficiency of aggregation through uniform partial information. To enhance the diversity of spatial and temporal cues, we introduce the Multi-Attention Part Augmentation (MA-PA) block, which incorporates fine parts at various granular levels, and the Long-/Short-term Temporal Augmentation (LS-TA) block, designed to capture both long- and short-term temporal relations. Using densely separated part cues, ST-MGA fully exploits and aggregates the spatiotemporal multi-granular patterns by comparing relations between parts and scales. In the experiments, the proposed ST-MGA renders state-of-the-art performance on several video-based ReID benchmarks (i.e., MARS, DukeMTMC-VideoReID, and LS-VID).

show abstract

Section: The Influence Of Granularitymentioning

confidence: 87%

Section: The Influence Of Granularitymentioning

confidence: 87%

Section: Video-based Person Reidmentioning

confidence: 99%

Section: Video-based Person Reidmentioning

confidence: 99%

See 2 more Smart Citations

Multi-Granularity Aggregation with Spatiotemporal Consistency for Video-Based Person Re-Identification

Lee,

Kim,

Jang

et al. 2024

Sensors

View full text Add to dashboard Cite

show abstract

“…The first is the one-stage method (Liu et al 2021a;Yang et al 2020;Yan et al 2020;He et al 2021b;Gu et al 2020), which utilizes 3D convolution or graph neural networks to learn spatial-temporal information from videos. As mentioned in (Wu et al 2022), 3D convolution-based methods are often affected by misalignment of adjacent frames and the occlusion problem. Furthermore, graph neural networks (Liu et al 2021a) usually require an additional pose estimation network to model the body relationships of the target person across frames.…”

Section: Introductionmentioning

confidence: 99%

Temporal Correlation Vision Transformer for Video Person Re-Identification

Wu,

Wang,

Zhou

et al. 2024

AAAI

View full text Add to dashboard Cite

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

show abstract