Exploring fusion strategies for accurate RGBT visual object tracking

Tang, Zhangyong; Xu, Tianyang; Li, Hui; Wu, Xiao-Jun; Zhu, Xuefeng; Kittler, Josef

doi:10.1016/j.inffus.2023.101881

Cited by 19 publications

(2 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a result, recent researches have been dominated by deep learning techniques. In recent studies, ranging from the simplest operation, concatenation (Zhang et al 2019), to the more complicated transformer architecture (Hui et al 2023;Zhu et al 2023a), the researchers have tried various fusion strategies with multiple intentions, including learning modality importance (Zhang et al 2021b;Tang, Xu, and Wu 2022), reducing the multi-modal redundancy (Li et al 2018;Zhu et al 2019), propagating the multi-modal patterns (Wang et al 2020), learning the multi-modal prompts from the auxiliary modality (Zhu et al 2023a), to name a few. With the increment in network complexity and the availability of larger training sets, tracking results have been gradually improved.…”

Section: Rgb-t Trackersmentioning

confidence: 99%

“…Due to the strict demand for the robustness of tracking systems in real-world applications, such as surveillance (Lu et al 2023) and unmanned driving (Zhang et al 2023a), visual object tracking with an auxiliary modality, named as multi-modal tracking, draws growing attention recently. For example, the thermal infrared (TIR) modality provides more stable scene perception in the nighttime (Tang et al 2023), and the depth (D) modality provides 3-D perception against occlusions (Zhu et al 2023b). In other words, the use of auxiliary modalities can complement the visible image in challenging scenarios.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generative-Based Fusion Mechanism for Multi-Modal Tracking

Tang,

Xu,

et al. 2024

AAAI

View full text Add to dashboard Cite

Generative models (GMs) have received increasing research interest for their remarkable capacity to achieve comprehensive understanding. However, their potential application in the domain of multi-modal tracking has remained unexplored. In this context, we seek to uncover the potential of harnessing generative techniques to address the critical challenge, information fusion, in multi-modal tracking. In this paper, we delve into two prominent GM techniques, namely, Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs). Different from the standard fusion process where the features from each modality are directly fed into the fusion block, we combine these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance. Based on this, we conduct extensive experiments across two multi-modal tracking tasks, three baseline methods, and four challenging benchmarks. The experimental results demonstrate that the proposed generative-based fusion mechanism achieves state-of-the-art performance by setting new records on GTOT, LasHeR and RGBD1K. Code will be available at https://github.com/Zhangyong-Tang/GMMT.

show abstract