Linguistic Structure Guided Context Modeling for Referring Image Segmentation

Hui, Tianrui; Liu, Si; Huang, Shaofei; Li, Guanbin; Yu, Sansi; Zhang, Faxi; Han, Jizhong

doi:10.1007/978-3-030-58607-2_4

Cited by 92 publications

(44 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To further explicitly align the vision and language modalities in a co-embedding space, Chen et al [3] generate the visual-textual co-embedding map in several recurrent steps. As graph neural network [35,42] presents a new form of mining the relationship between data, Hui et al [15] and Yang et al introduce graph structure models to achieve efficient message passing in RIS. Moreover, some works [13,15] also consider the linguistic roles of each word during multimodal interaction process.…”

Section: Related Workmentioning

confidence: 99%

“…As graph neural network [35,42] presents a new form of mining the relationship between data, Hui et al [15] and Yang et al introduce graph structure models to achieve efficient message passing in RIS. Moreover, some works [13,15] also consider the linguistic roles of each word during multimodal interaction process. Words are classified into four categories, and a progressive comprehension process is proposed under the guidance of different type of words in [13].…”

Section: Related Workmentioning

confidence: 99%

“…Another reason is that small objects appear more frequently in G-Ref than in UNC+ as shown in 7, therefore, the G-Ref is better able to benefit from our model. In addition, two methods (STEP [3] and LSCM [15]) adopting high-resolution visual features should be especially noticed. STEP [3] iteratively fuses 5 levels of visual features for 25 times and LSCM [15] sequentially aggregates 4 levels of visual features through bottom-to-up and top-to-down style as in [22].…”

Section: Comparisons With State-of-the-art Approachesmentioning

confidence: 99%

“…In addition, two methods (STEP [3] and LSCM [15]) adopting high-resolution visual features should be especially noticed. STEP [3] iteratively fuses 5 levels of visual features for 25 times and LSCM [15] sequentially aggregates 4 levels of visual features through bottom-to-up and top-to-down style as in [22]. Instead of repeatedly utilizing high-resolution visual feature, we directly feed high-resolution into proposed AMF block, but still yields 3.68% and 2.03% overall IoU boost against STEP and LSCM+DCRF on G-Ref val set, which indicates the effectiveness of our design.…”

Section: Comparisons With State-of-the-art Approachesmentioning

confidence: 99%

See 3 more Smart Citations

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

Jiao

Jie

Luo

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions as well as complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents with weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work. In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically, RES retrieves the most relevant image from an external data pool with regard to both the visual and textual similarities, and then enriches the visual information of the referent with the retrieved image for better multimodal feature learning. AMF further enhances the visual detailed information by incorporating the highresolution feature maps from lower convolution layers of the image. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-theart approaches on four benchmark datasets. Our code is available at: https://github.com/SxJyJay/TV-Net.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Comparisons With State-of-the-art Approachesmentioning

confidence: 99%

Section: Comparisons With State-of-the-art Approachesmentioning

confidence: 99%

See 2 more Smart Citations

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

Jiao

Jie

Luo

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…[12] utilize query attention and key-word-aware visual context to model relationships among different image regions, according to the corresponding query. More recent works, [13] model multimodal context by cross-modal interaction and guided through a dependency tree structure, [14] progressively exploits various types of words in the expression to segment the referent in a graph-based structure. In contrast to existing works on RIS that directly refer to objects in an image, we ground the region adjacent to the object to provide navigational guidance to a self-driving vehicle.…”

Section: B Referring Image Segmentationmentioning

confidence: 99%

Grounding Linguistic Commands to Navigable Regions

Rufus,

Jain,

Nair

et al. 2021

Preprint

View full text Add to dashboard Cite

Humans have a natural ability to effortlessly comprehend linguistic commands such as "park next to the yellow sedan" and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command "park next to the yellow sedan," RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car [1] dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.

show abstract