2022
DOI: 10.1007/978-3-031-19781-9_32
|View full text |Cite
|
Sign up to set email alerts
|

CAViT: Contextual Alignment Vision Transformer for Video Object Re-identification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(14 citation statements)
references
References 44 publications
0
14
0
Order By: Relevance
“…Specifically, the proposed method also outperformed TCLNet [18] and BiCnet-TKS [42], which usesimilar diverse attention-based methods, with an improvement of up to 1.4%/2.2% and 1.2%/1.8% mAP/Rank-1 accuracy in MARS, respectively. Further, ST-MGA outperformed several recent models (i.e., SINet [37], CAVIT [38], HMN [40], SGMN [41], and BIC+LGCN [42]). In particular, the proposed method shows higher accuracy than the complex transformer-based method [38], which has recently attracted attention.…”
Section: The Influence Of Granularitymentioning
confidence: 87%
See 3 more Smart Citations
“…Specifically, the proposed method also outperformed TCLNet [18] and BiCnet-TKS [42], which usesimilar diverse attention-based methods, with an improvement of up to 1.4%/2.2% and 1.2%/1.8% mAP/Rank-1 accuracy in MARS, respectively. Further, ST-MGA outperformed several recent models (i.e., SINet [37], CAVIT [38], HMN [40], SGMN [41], and BIC+LGCN [42]). In particular, the proposed method shows higher accuracy than the complex transformer-based method [38], which has recently attracted attention.…”
Section: The Influence Of Granularitymentioning
confidence: 87%
“…Further, ST-MGA outperformed several recent models (i.e., SINet [37], CAVIT [38], HMN [40], SGMN [41], and BIC+LGCN [42]). In particular, the proposed method shows higher accuracy than the complex transformer-based method [38], which has recently attracted attention. The above results verify the effectiveness and superiority of ST-MGA in video ReID.…”
Section: The Influence Of Granularitymentioning
confidence: 87%
See 2 more Smart Citations
“…The first is the one-stage method (Liu et al 2021a;Yang et al 2020;Yan et al 2020;He et al 2021b;Gu et al 2020), which utilizes 3D convolution or graph neural networks to learn spatial-temporal information from videos. As mentioned in (Wu et al 2022), 3D convolution-based methods are often affected by misalignment of adjacent frames and the occlusion problem. Furthermore, graph neural networks (Liu et al 2021a) usually require an additional pose estimation network to model the body relationships of the target person across frames.…”
Section: Introductionmentioning
confidence: 99%