2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.01075
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Modal Self-Attention Network for Referring Image Segmentation

Abstract: We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range depend… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
237
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
4
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 340 publications
(242 citation statements)
references
References 24 publications
1
237
0
Order By: Relevance
“…At first, Transformers showed great performances in NLP tasks. Then, Transformers were applied in computer vision tasks such as video processing [69], image super-resolution [15], object detection [13] and segmentation [70], and image classification [71] thanks to their excellent performance.…”
Section: Methods Based On Transformersmentioning
confidence: 99%
“…At first, Transformers showed great performances in NLP tasks. Then, Transformers were applied in computer vision tasks such as video processing [69], image super-resolution [15], object detection [13] and segmentation [70], and image classification [71] thanks to their excellent performance.…”
Section: Methods Based On Transformersmentioning
confidence: 99%
“…The main idea of self-attention is to help convolutions throughout the image domain to capture long-range, full-level interconnections. The network implemented with a self-attention module can help to determine images with small details that are connected with fine details in different areas of the image at each position [20][21][22].…”
Section: Self-attention (Sa) Modulementioning
confidence: 99%
“…Multi-scale context modeling has verified its effectiveness on boosting the segmentation accuracy of objects in semantic segmentation [11,12,13,14]. Recent works also have shown that the performance of RIS can be further improved through aggregating long-range context from concatenated visual and linguistic features [15] with self-attention [16], or collecting multi-scale context from fused multi-model features [10,17] with atrous spatial pyramid pooling (ASPP) [11,18]. However, the former is high memory cost for computing the affinity map and may introduce redundant features, which are harmful to distinguish the referred object.…”
Section: Introductionmentioning
confidence: 96%