2021
DOI: 10.48550/arxiv.2104.01745
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification

Xuehu Liu,
Pingping Zhang,
Chenyang Yu
et al.

Abstract: Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under nonoverlapping cameras. Previous methods usually focus on limited views, such as spatial, temporal or spatial-temporal view, which lack of the observations in different feature domains. To capture richer perceptions and extract more comprehensive video representations, in this paper we propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID. More specifically, we design a tr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 8 publications
(13 citation statements)
references
References 33 publications
0
12
0
Order By: Relevance
“…For example, [27,35,48] integrate Transformer layers into the CNN backbone to aggregate hierarchical features and align local features. For video ReID, [28,49] exploit Transformer to aggregate appearance features, spatial features, and temporal features to learn a discriminative representation for a person tracklet.…”
Section: Transformer-based Reidmentioning
confidence: 99%
“…For example, [27,35,48] integrate Transformer layers into the CNN backbone to aggregate hierarchical features and align local features. For video ReID, [28,49] exploit Transformer to aggregate appearance features, spatial features, and temporal features to learn a discriminative representation for a person tracklet.…”
Section: Transformer-based Reidmentioning
confidence: 99%
“…The MCA measures the correlation among crosshypothesis features and has a similar structure to MSA. The common configuration of MCA uses the same input between keys and values [3,25,42]. However, an issue with this configuration is that it will result in more blocks (e.g., 2M MCA blocks for M hypotheses).…”
Section: Cross-hypothesis Interactionmentioning
confidence: 99%
“…The common configuration of MCA uses the same input between keys and values [3,25,42], i.e., the inputs x = y = z. Instead, we adopt a more efficient strategy by using different inputs, i.e., the inputs x = y = z.…”
Section: Supplementary Materialsmentioning
confidence: 99%
See 1 more Smart Citation
“…For video-based person Re-ID, Liu et al [25] design a trigeminal network to transform video data into spatial, temporal and spatial-temporal feature spaces. Zhang et al [48] design perceptionconstrained Transformers to decrease the risk of overfitting.…”
Section: Transformer In Visionmentioning
confidence: 99%