2021
DOI: 10.48550/arxiv.2108.06932
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers

Abstract: Most polyp segmentation methods use CNNs as their backbone, leading to two key issues when exchanging information between the encoder and decoder: 1) taking into account the differences in contribution between different-level features; and 2) designing effective mechanism for fusing these features. Different from existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of po… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
67
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 23 publications
(67 citation statements)
references
References 68 publications
0
67
0
Order By: Relevance
“…For example, Vision Transformer (ViT) [1] first showed that a pure transformer can archive stateof-the-art performance in image classification. The Pyramid Vision Transformer (PVT v1) [3] showed that a pure transformer backbone can also surpass CNN counterparts for dense prediction tasks such as detection and segmentation [9][10][11]. Later, Swin transformer [5], CoaT [6], LeViT [7], and Twins [8] further improved classification, detection, and segmentation performance with transformer backbones.…”
mentioning
confidence: 99%
“…For example, Vision Transformer (ViT) [1] first showed that a pure transformer can archive stateof-the-art performance in image classification. The Pyramid Vision Transformer (PVT v1) [3] showed that a pure transformer backbone can also surpass CNN counterparts for dense prediction tasks such as detection and segmentation [9][10][11]. Later, Swin transformer [5], CoaT [6], LeViT [7], and Twins [8] further improved classification, detection, and segmentation performance with transformer backbones.…”
mentioning
confidence: 99%
“…Nanni et al [ 36 ] proposed encoder–decoder ensemble classifiers that can be used for semantic segmentation and introduced a novel loss function that results from the combination of Dice loss and a structural similarity index (SSIM). Dong et al [ 37 ] presented a pyramid vision transformer backbone as an encoder for the extraction of robust features that has three tight components: a cascaded fusion module (CFM), camouflage identification module (CIM), and similarity aggregation module (SAM). The sum of the IoU and weighted binary cross-entropy loss is used as the loss function.…”
Section: Related Workmentioning
confidence: 99%
“…• In [1] and [60] several deep learning segmentation approaches are compared, SegNet, U-Net, DeepLabv3+, HarD-NetMSEG (Harmonic Densely Connected Network) 1 [61] and Polyp-PVT [62] a deep learning segmentation model based on a transformer encoder, i.e. PVT (Pyramid Vision Transformer) 2 .…”
Section: Skin Detection Approachesmentioning
confidence: 99%
“…• number of epoch=10 (using the simple data augmentation approach DA1, see section 3.3) or 15 (the latter more complex data augmentation approach DA2, see section 3. We present an ensemble based on DeepLabV3+, HarDNet-MSEG [61], Polyp-PVT [62], and Hybrid Semantic Network (HSN) [79]. HarD-Net-MSEG (Harmonic Densely Connected Network) [61] is a model influenced by densely connected networks, that can reduce memory consumption by diminishing aggregation with the reduction of most connection layers to the DenseNet layer.…”
Section: Deep Learning For Semantic Image Segmentationmentioning
confidence: 99%