SiamCDA: Complementarity- and Distractor-Aware RGB-T Tracking Based on Siamese Network

Zhang, Tianlu; Liu, Xueru; Zhang, Qiang; Han, Jungong

doi:10.1109/tcsvt.2021.3072207

Cited by 62 publications

(20 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zhu et al [ 7 ] proposed a trident architecture to integrate the fused modality features and two modality-specific features, thus achieving robust target representation. Zhang et al [ 33 ] introduced a complementary perception module for multi-modal feature fusion, which reduces the modality discrepancy between single-modal features to enhance the discriminability of fused features. This is done to fully utilize training data and address various challenges such as illumination variations, occlusion, thermal crossover, and fast motion.…”

Section: Related Workmentioning

confidence: 99%

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

Luo

Guo

Dong

et al. 2023

Sensors

View full text Add to dashboard Cite

RGB-T tracking involves the use of images from both visible and thermal modalities. The primary objective is to adaptively leverage the relatively dominant modality in varying conditions to achieve more robust tracking compared to single-modality tracking. An RGB-T tracker based on a mixed-attention mechanism to achieve a complementary fusion of modalities (referred to as MACFT) is proposed in this paper. In the feature extraction stage, we utilize different transformer backbone branches to extract specific and shared information from different modalities. By performing mixed-attention operations in the backbone to enable information interaction and self-enhancement between the template and search images, a robust feature representation is constructed that better understands the high-level semantic features of the target. Then, in the feature fusion stage, a modality shared-specific feature interaction structure was designed based on a mixed-attention mechanism, effectively suppressing low-quality modality noise while enhancing the information from the dominant modality. Evaluation on multiple RGB-T public datasets demonstrates that our proposed tracker outperforms other RGB-T trackers on general evaluation metrics while also being able to adapt to long-term tracking scenarios.

show abstract

Section: Related Workmentioning

confidence: 99%

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

Luo

Guo

Dong

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…Gao et al [ 33 ] weighted the modals to make the network focus on more favorable fields to effectively integrate different modals. Zhang et al [ 34 ] took the feature maps from two-stream Siamese networks as inputs and weighted the features through the weight generation sub-network to obtain the additional information between modals. Then, the enhanced features were obtained by using cross-modal residual connections, and finally, these features were concatenated.…”

Section: Related Workmentioning

confidence: 99%

Optical Flow-Aware-Based Multi-Modal Fusion Network for Violence Detection

Xiao

Gao

Wang

et al. 2022

Entropy

View full text Add to dashboard Cite

Violence detection aims to locate violent content in video frames. Improving the accuracy of violence detection is of great importance for security. However, the current methods do not make full use of the multi-modal vision and audio information, which affects the accuracy of violence detection. We found that the violence detection accuracy of different kinds of videos is related to the change of optical flow. With this in mind, we propose an optical flow-aware-based multi-modal fusion network (OAMFN) for violence detection. Specifically, we use three different fusion strategies to fully integrate multi-modal features. First, the main branch concatenates RGB features and audio features and the optical flow branch concatenates optical flow features with RGB features and audio features, respectively. Then, the cross-modal information fusion module integrates the features of different combinations and applies weights to them to capture cross-modal information in audio and video. After that, the channel attention module extracts valuable information by weighting the integration features. Furthermore, an optical flow-aware-based score fusion strategy is introduced to fuse features of different modalities from two branches. Compared with methods on the XD-Violence dataset, our multi-modal fusion network yields APs that are 83.09% and 1.4% higher than those of the state-of-the-art methods in offline detection, and 78.09% and 4.42% higher than those of the state-of-the-art methods in online detection.

show abstract

“…DeT [47] adds a depth feature extraction branch to the original ATOM [7] or DiMP [3] tracker and fine-tunes on RGB-D training data. Zhang et al [57] extend SiamRPN++ [21] with dual-modal inputs for RGB-T tracking. They first con-struct a unimodal tracking network trained on RGB data, then tune the whole extended multi-modal network with RGB-T image pairs.…”

Section: Introductionmentioning

confidence: 99%

Visual Prompt Multi-Modal Tracking

Zhu¹,

Lai²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

show abstract

SiamCDA: Complementarity- and Distractor-Aware RGB-T Tracking Based on Siamese Network

Cited by 62 publications

References 54 publications

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

Optical Flow-Aware-Based Multi-Modal Fusion Network for Violence Detection

Visual Prompt Multi-Modal Tracking

Contact Info

Product

Resources

About