Transformers and CNNs fusion network for salient object detection

Yao, Cuili; Feng, Lin; Kong, Yuqiu; Xiao, Lin; Chen, Tao

doi:10.1016/j.neucom.2022.10.081

Cited by 12 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Y Trans f ormer = f latten Trans f ormer f eatur_shape (18) Then, the Y Transformer is passed onto the deep neural network classifier. Moreover, the required optimum parameters for the transformer feature extractor of the Conv-ViT network are summarized in Table 1.…”

Section: Vision Transformermentioning

confidence: 99%

“…This ViT-ARN was trained using a total of 858 and 1600 videos from two datasets and was evaluated based on two datasets-LAD-2000 and UCF-Crime datasets where the proposed framework outperformed other state-of-the-art approaches with an increased accuracy of 10.14% and 3% in these two datasets, respectively. In a separate study, Yao et al [18] proposed a fusion of transformers and CNN for salient object detection (SOD) where the transformer captured the long-distance pixel relationship, and later, a CNN was applied, which extracted the fine-grained local details. This incorporation resolved the problem of using a CNN-based network and showed equal effectivity for both RGB and RGB-D (RGB and depth) SOD.…”

Section: Introductionmentioning

confidence: 99%

“…In this way, the extracted feature is changed by the latter model, which is why the extracted feature in every model does not have the same significance in the final classification. Concerning these findings [12][13][14][15][16][17][18][19][20][21][22], instead of ensembling or stacking the models, this work proposes a hybrid feature extraction method by fusing conventional pre-trained CNN models such as Inception-V3 and ResNet-50 and a transformer model. In this framework, the individual model extracts the feature individually, and later, the extracted features become fused.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Conv-ViT: A Convolution and Vision Transformer-Based Hybrid Feature Extraction Method for Retinal Disease Detection

et al. 2023

View full text Add to dashboard Cite

The current advancement towards retinal disease detection mainly focused on distinct feature extraction using either a convolutional neural network (CNN) or a transformer-based end-to-end deep learning (DL) model. The individual end-to-end DL models are capable of only processing texture or shape-based information for performing detection tasks. However, extraction of only texture- or shape-based features does not provide the model robustness needed to classify different types of retinal diseases. Therefore, concerning these two features, this paper developed a fusion model called ‘Conv-ViT’ to detect retinal diseases from foveal cut optical coherence tomography (OCT) images. The transfer learning-based CNN models, such as Inception-V3 and ResNet-50, are utilized to process texture information by calculating the correlation of the nearby pixel. Additionally, the vision transformer model is fused to process shape-based features by determining the correlation between long-distance pixels. The hybridization of these three models results in shape-based texture feature learning during the classification of retinal diseases into its four classes, including choroidal neovascularization (CNV), diabetic macular edema (DME), DRUSEN, and NORMAL. The weighted average classification accuracy, precision, recall, and F1 score of the model are found to be approximately 94%. The results indicate that the fusion of both texture and shape features assisted the proposed Conv-ViT model to outperform the state-of-the-art retinal disease classification models.

show abstract

Section: Vision Transformermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Conv-ViT: A Convolution and Vision Transformer-Based Hybrid Feature Extraction Method for Retinal Disease Detection

et al. 2023

View full text Add to dashboard Cite

show abstract

“…This evolution can be traced back to AlexNet [17], which introduced a fundamental CNN architecture that achieved groundbreaking results on challenging datasets. Inspired by AlexNet, researchers began applying convolutional neural networks to various deep learning tasks, establishing them as one of the prevailing approaches in contemporary research [18][19][20][21][22].Initially, the approach involved using convolutional neural networks in a sliding window model [23]. However, due to computational complexity limitations, the sliding window approach gradually gave way to the region proposal method [24].…”

Section: Related Workmentioning

confidence: 99%

EB-YOLOX:A Balanced One-Stage Object Detection Model Concentrating on Global Features

Wang,

Liu

et al. 2023

Preprint

View full text Add to dashboard Cite

With the increasing complexity of objects and the limitation of hardware resources, it is crucial to design object detection models that are both highly effective and efficient. Although YOLOX, as a leading object detector, strikes a good balance between the number of parameters and performance, it still has some design weaknesses. These include constraints in Bottleneck extracted features, insufficient feature fusion and information loss in FPN (Feature Pyramid Networks), and an uneven trade-off between performance and efficiency in the detection head. These limitations affect the equilibrium between the number of parameters and model performance. To address these issues, this paper presents an object detection model known as EB-YOLOX (Emphasized Balance YOLOX), which incorporates three key elements: Multidirectional Feature Extraction Bottleneck, Decoupled bidirectional Feature Pyramid Network, and Efficient Decoupling head network. With these components, EB-YOLOX effectively tackles model performance challenges and undergoes a comprehensive evaluation on the MS COCO dataset. Experimental results indicate that EB-YOLOX outperforms YOLOX-S by 1.8% in Average Precision (AP) and achieves significant performance improvements across various scenarios, highlighting its excellent balance between the number of parameters and performance.

show abstract

“…S ALIENT object detection (SOD) aims at detecting the most visually attractive objects from the inputs [1], [2], which has been widely performed on many computer vision tasks, such as tracking [3], segmentation [4], [5], action recognition [6], camouflaged object detection [7], and so on.…”

Section: Introductionmentioning

confidence: 99%

Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection

Huang

Yang

Zhang

et al. 2022

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Multi-modal feature fusion and saliency reasoning are two core sub-tasks of RGB-D salient object detection. However, most existing models employ linear fusion strategies (e.g., concatenation) for multi-modal feature fusion and use a simple coarse-to-fine structure for saliency reasoning. Despite their simpleness, they can neither fully capture the cross-modal complementary information nor exploit the multi-level complementary information among the cross-modal features at different levels. To address these issues, a novel RGB-D salient object detection model is presented, where we pay special attention to the aforementioned two sub-tasks. Concretely, a multi-modal feature interaction module is first presented to explore more interactions between the unimodal RGB and depth features. It helps to capture their cross-modal complementary information by jointly using some simple linear fusion strategies and bilinear fusion ones. Then, a saliency prior information guided fusion module is presented to exploit the multi-level complementary information among the fused cross-modal features at different levels. Instead of employing a simple convolutional layer for the final saliency prediction, a saliency refinement and prediction module is designed to better exploit those extracted multilevel cross-modal information for RGB-D saliency detection. Experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed framework over some state-of-the-art methods.Index Terms-RGB-D salient object detection, bilinear fusion strategy, saliency prior information guided fusion, saliency refinement and prediction. [9] and segmentation [10]. Benefiting from the progress of Convolutional Neural Networks (CNNs), CNNs based RGB SOD models [2], [11], [12], [13] have significantly improved the performance of conventional hand-crafted feature based approaches [14], [15], [16], [17].However, such algorithms are found vulnerable to complex environments, varying illuminations or cluttered backgrounds. After paying a lot of efforts, researchers realize that using RGB images only cannot solve those challenges. Meanwhile,

show abstract

Transformers and CNNs fusion network for salient object detection

Cited by 12 publications

References 29 publications

Conv-ViT: A Convolution and Vision Transformer-Based Hybrid Feature Extraction Method for Retinal Disease Detection

Conv-ViT: A Convolution and Vision Transformer-Based Hybrid Feature Extraction Method for Retinal Disease Detection

EB-YOLOX:A Balanced One-Stage Object Detection Model Concentrating on Global Features

Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection

Contact Info

Product

Resources

About