HRTransNet: HRFormer-Driven Two-Modality Salient Object Detection

Tang, Bin; Liu, Zhengyi; Tan, Yacheng; He, Qian

doi:10.1109/tcsvt.2022.3202563

Cited by 31 publications

(4 citation statements)

References 130 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Red and blue denote the best and the second-best results, respectively. HDFNet [29], CoNet [19], BBS-Net [13], JL-DCF-R [27], SPNet [113], CMINet [114], DCF [115] , as well as four state-of-theart transformer-based RGB-D SOD models, namely SwinNet [85], HRTransNet [86], EBMGSOD [84], and our previous VST [44], for comparison. Table 4 and Table 5 report the comparison results.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

“…Following VST, we leverage the pre-trained T2T-ViT t -14 model [46] as our backbone to create the VST-t++ model. Moreover, some transformer-based models have been proposed for RGB SOD [83,84] and RGB-D SOD [84,85,86] with the Swin Transformer family [49] as the backbone. Following this trend, we explore three Swin Transformer models with different scales, i.e.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

“…For RGB-D SOD, SwinNet [85] successfully unified RGB-D SOD and RGB-T SOD, aligning spatial and re-calibrating channels with the help of hierarchical features generated by the Swin Transformer [49]. HRTransNet [86] also unified these two tasks and applied a high-resolution network [87] to SOD.…”

Section: Transformer-based Sodmentioning

confidence: 99%

See 2 more Smart Citations

Visual Saliency Transformer

Liu

Zhang

Wan

et al. 2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

294

117

View full text Add to dashboard Cite

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e.VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Visual Saliency Transformer

Liu

Zhang

Wan

et al. 2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

294

117

View full text Add to dashboard Cite

show abstract

“…DCMNet [68] and HRTransNet [69]. To ensure the fairness of the comparison, we used the saliency maps provided by the authors.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion

Peng,

Zhai,

Feng

2024

IEEE Access

View full text Add to dashboard Cite

Existing RGB-D saliency detection models have not fully considered the differences between features at various levels, and lack an effective mechanism for cross-level feature fusion. This article proposes a novel cross-modality cross-level fusion learning framework. The framework mainly contains three modules: Attention Enhancement Module (AEM), Modality Feature Fusion Module (MFM), and Graph Reasoning Module (GRM). AEM is used to enhance the features of the two modalities. MFM is used to integrate the features of the two modalities to achieve cross-modality feature fusion. Subsequently, the modality fusion features are divided into high-level features and low-level features. The high-level features contain the semantic localization information of salient objects, and the low-level features contain the detailed information of salient objects. GRM extends the semantic localization information of salient objects in the high-level features from pixel features to the entire salient object area, thereby achieving cross-level feature fusion. This framework can effectively eliminate background noise and enhance the model's expressiveness. Extensive experiments were conducted on seven widely used datasets, and the results show that the new method outperforms nine current state-of-the-art RGB-D SOD methods.

show abstract