2023
DOI: 10.1016/j.patrec.2023.02.023
|View full text |Cite
|
Sign up to set email alerts
|

Transformer vision-language tracking via proxy token guided cross-modal fusion

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 4 publications
0
3
0
Order By: Relevance
“…As a new topic in computer vision, vision-language visual tracking has attracted a lot of attention from researchers in recent years especially [7,[12][13][14][15][16]22], along with the rapid development of natural language processing. Li [22] was the first to apply the fusion of vision-language features in a tracking task.…”
Section: Vision-language Object Trackingmentioning
confidence: 99%
See 2 more Smart Citations
“…As a new topic in computer vision, vision-language visual tracking has attracted a lot of attention from researchers in recent years especially [7,[12][13][14][15][16]22], along with the rapid development of natural language processing. Li [22] was the first to apply the fusion of vision-language features in a tracking task.…”
Section: Vision-language Object Trackingmentioning
confidence: 99%
“…Guo [44] proposed a ModaMixer and asymmetrical networks to learn a unified-adaptive vision-language representation. Zhao [7] presented a transformer-based tracking network, using a proxy token to guide the cross-modal attention. The proxy token is used to modulate word embeddings and make them attend to visual features.…”
Section: Vision-language Object Trackingmentioning
confidence: 99%
See 1 more Smart Citation