Transformer vision-language tracking via proxy token guided cross-modal fusion

Zhao, Haojie; Wang, Xiao; Wang, Dong; Lu, Huchuan; Ruan, Xiang

doi:10.1016/j.patrec.2023.02.023

Cited by 7 publications

(3 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a new topic in computer vision, vision-language visual tracking has attracted a lot of attention from researchers in recent years especially [7,[12][13][14][15][16]22], along with the rapid development of natural language processing. Li [22] was the first to apply the fusion of vision-language features in a tracking task.…”

Section: Vision-language Object Trackingmentioning

confidence: 99%

“…Guo [44] proposed a ModaMixer and asymmetrical networks to learn a unified-adaptive vision-language representation. Zhao [7] presented a transformer-based tracking network, using a proxy token to guide the cross-modal attention. The proxy token is used to modulate word embeddings and make them attend to visual features.…”

Section: Vision-language Object Trackingmentioning

confidence: 99%

“…From a model-based perspective, tracking algorithms have evolved from classical correlation-filter-based models to deep neural networks due to their powerful feature representation [1][2][3][4][5][6]. In the last few years, transformer-based trackers have shown improved performances due to the development of an attention mechanism that enables the modeling of complex feature interactions [7][8][9]. However, existing single-model trackers do not perform as well in practice as they have done during testing with publicly available datasets, especially in challenging scenarios such as viewpoint changes, f ast motion, and illumination variation, as shown in Figure 1; here, poor feature representations and model drifting often lead to tracking failures.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multimodal Features Alignment for Vision–Language Object Tracking

Ye,

Xiao,

Liu

2024

Remote Sensing

View full text Add to dashboard Cite

Vision–language tracking presents a crucial challenge in multimodal object tracking. Integrating language features and visual features can enhance target localization and improve the stability and accuracy of the tracking process. However, most existing fusion models in vision–language trackers simply concatenate visual and linguistic features without considering their semantic relationships. Such methods fail to distinguish the target’s appearance features from the background, particularly when the target changes dramatically. To address these limitations, we introduce an innovative technique known as multimodal features alignment (MFA) for vision–language tracking. In contrast to basic concatenation methods, our approach employs a factorized bilinear pooling method that conducts squeezing and expanding operations to create a unified feature representation from visual and linguistic features. Moreover, we integrate the co-attention mechanism twice to derive varied weights for the search region, ensuring that higher weights are placed on the aligned visual and linguistic features. Subsequently, the fused feature map with diverse distributed weights serves as the search region during the tracking phase, facilitating anchor-free grounding to predict the target’s location. Extensive experiments are conducted on multiple public datasets, and our proposed tracker obtains a success score of 0.654/0.553/0.447 and a precision score of 0.872/0.556/0.513 on OTB-LANG/LaSOT/TNL2K. These results are satisfying compared with those of recent state-of-the-art vision–language trackers.

show abstract