Referring Image Segmentation via Recurrent Refinement Networks

Li, Ruiyu; Li, Kaican; Kuo, Yiyo; Shu, Michelle; Qi, Xiaojuan; Shen, Xiaoyong; Jia, Jiaya

doi:10.1109/cvpr.2018.00602

Cited by 154 publications

(137 citation statements)

References 16 publications

Supporting

Mentioning

135

Contrasting

Order By: Relevance

“…In contrast, our model uses a cross-modal selfattention module that can effectively model long-range dependencies between linguistic and visual modalities. Lastly, different from [15] which adopts ConvLSTM to refine segmentation with multi-scale visual features sequentially, the proposed method employs a novel gated fusion module for combining multi-level self-attentive features.…”

Section: Our Modelmentioning

confidence: 99%

“…For the language description with N words, we encode each word w n as a one-hot vector, and project it into a compact word embedding represented as e n ∈ R C l by a lookup table. Different from previous methods [10,15,22] that apply LSTM to process the word vectors sequentially and encode the entire language description as a sentence vector, we keep the individual word vectors and introduce a crossmodal self-attention module to capture long-range correlations between these words and spatial regions in the image. More details will be presented in Sec.…”

Section: Multimodal Featuresmentioning

confidence: 99%

“…Implementation details: Following previous work [15,17,22], we keep the maximum length of query expression as 20 and embed each word to a vector of C l = 1000 dimensions. Given an input image, we resize it to 320 × 320 and use the outputs of DeepLab-101 ResNet blocks Res3, Res4, Res5 as the inputs for multimodal features.…”

Section: Datasets and Setupmentioning

confidence: 99%

“…Here we use our cross-modal selfattentive features as inputs and compare with several wellknown feature fusion techniques, such as Deconv [21] and PPM [30] in semantic segmentation and ConvLSTM [15] in referring image segmentation.…”

Section: Ablation Studymentioning

confidence: 99%

“…A popular approach (e.g. [10,15,22]) in this area is to * Zhi Liu and Yang Wang are the corresponding authors Figure 1. (Best viewed in color) Illustration of our cross-modal self-attention mechanism.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Cross-Modal Self-Attention Network for Referring Image Segmentation

Rochan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

325

215

View full text Add to dashboard Cite

We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the input image. In addition, we propose a gated multi-level fusion module to selectively integrate self-attentive cross-modal features corresponding to different levels in the image. This module controls the information flow of features at different levels. We validate the proposed approach on four evaluation datasets. Our proposed approach consistently outperforms existing state-of-the-art methods.

show abstract

Section: Our Modelmentioning

confidence: 99%

Section: Multimodal Featuresmentioning

confidence: 99%

Section: Datasets and Setupmentioning

confidence: 99%

Section: Ablation Studymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Cross-Modal Self-Attention Network for Referring Image Segmentation

Rochan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

325

215

View full text Add to dashboard Cite

show abstract

Video Object Segmentation with Language Referring Expressions

Khoreva

Rohrbach

Schiele

2019

Computer Vision – ACCV 2018

View full text Add to dashboard Cite

Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our approach we augment the popular video object segmentation benchmarks, DAVIS16 and DAVIS17 with language descriptions of target objects. We show that our language-supervised approach performs on par with the methods which have access to a pixel-level mask of the target object on DAVIS16 and is competitive to methods using scribbles on the challenging DAVIS17 dataset.Query: "A man in a red sweatshirt performing breakdance" Figure 1: Examples of the proposed approach. Classical semi-supervised video object segmentation relies on an expensive pixel-level mask annotation of a target object in the first frame of a video. We explore a more natural and more practical way of pointing out a target object by providing a language referring expression.

show abstract

CLCI-Net: Cross-Level Fusion and Context Inference Networks for Lesion Segmentation of Chronic Stroke

Yang

Huang

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Segmenting stroke lesions from T1-weighted MR images is of great value for large-scale stroke rehabilitation neuroimaging analyses. Nevertheless, there are great challenges with this task, such as large range of stroke lesion scales and the tissue intensity similarity. The famous encoder-decoder convolutional neural network, which although has made great achievements in medical image segmentation areas, may fail to address these challenges due to the insufficient uses of multi-scale features and context information. To address these challenges, this paper proposes a Cross-Level fusion and Context Inference Network (CLCI-Net) for the chronic stroke lesion segmentation from T1-weighted MR images. Specifically, a Cross-Level feature Fusion (CLF) strategy was developed to make full use of different scale features across different levels; Extending Atrous Spatial Pyramid Pooling (ASPP) with CLF, we have enriched multi-scale features to handle the different lesion sizes; In addition, convolutional long short-term memory (ConvLSTM) is employed to infer context information and thus capture fine structures to address the intensity similarity issue. The proposed approach was evaluated on an open-source dataset, the Anatomical Tracings of Lesions After Stroke (ATLAS) with the results showing that our network outperforms five state-of-the-art methods. We make our code and models available at https://github.com/YH0517/CLCI_Net.

show abstract

Referring Image Segmentation via Recurrent Refinement Networks

Cited by 154 publications

References 16 publications

Cross-Modal Self-Attention Network for Referring Image Segmentation

Cross-Modal Self-Attention Network for Referring Image Segmentation

Video Object Segmentation with Language Referring Expressions

CLCI-Net: Cross-Level Fusion and Context Inference Networks for Lesion Segmentation of Chronic Stroke

Contact Info

Product

Resources

About