2018
DOI: 10.1007/978-3-030-01252-6_39
|View full text |Cite
|
Sign up to set email alerts
|

Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

Abstract: We address the problem of segmenting an object given a natural language expression that describes it. Current techniques tackle this task by either (i) directly or recursively merging linguistic and visual information in the channel dimension and then performing convolutions; or by (ii) mapping the expression to a space in which it can be thought of as a filter, whose response is directly related to the presence of the object at a given spatial coordinate in the image, so that a convolution can be applied to l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
93
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 143 publications
(95 citation statements)
references
References 23 publications
0
93
0
Order By: Relevance
“…We contrast our approach to some closely related works, including two existing one-stage grounding methods [52,44] and some on grounding as segmentation [13,20,25,9].…”
Section: Comparison To Other One-stage Grounding Workmentioning
confidence: 99%
“…We contrast our approach to some closely related works, including two existing one-stage grounding methods [52,44] and some on grounding as segmentation [13,20,25,9].…”
Section: Comparison To Other One-stage Grounding Workmentioning
confidence: 99%
“…To better achieve word-to-image interaction, [17] directly combines visual features with each word feature from a language LSTM to recurrently refine segmentation results. Dynamic filter [20] for each word further enhances this interaction. In [22], word attention is incorporated in the image regions to model key-word-aware context.…”
Section: Related Workmentioning
confidence: 99%
“…Secondly, some previous works (e.g. [17,20]) process each word in the referring expression and concatenate it with visual features to infer the referred object in a sequential order using a recurrent network. The limitation is that these methods only look at local spatial regions and lack the interaction over long-range spatial regions in global context which is essential for semantic understanding and segmentation.…”
Section: Our Modelmentioning
confidence: 99%
“…[21] used a convolutional LSTM in addition to the language-only LSTM to facilitate propagation of intermediate segmentation beliefs. [20,26] improved upon [21] by making more architectural improvements.…”
Section: Referring Expressionsmentioning
confidence: 99%