2016
DOI: 10.1007/978-3-319-46475-6_5
|View full text |Cite
|
Sign up to set email alerts
|

Modeling Context in Referring Expressions

Abstract: Humans refer to objects in their environments all the time, especially in dialogue with other people. We explore generating and comprehending natural language referring expressions for objects in images.In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to other objects within an image helps improve performance significantly. We also develop methods to tie the language generation process together, so that we generate expre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
772
0

Year Published

2017
2017
2019
2019

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 596 publications
(774 citation statements)
references
References 34 publications
2
772
0
Order By: Relevance
“…To learn the common feature space, they propose different matching loss functions to optimize, e.g., softmax loss [16,21] and triplet loss [25]. Another work [18,30,19] learns to maximize the likelihood of the expression given the referent and the image, and the work inputs the fusion of visual object feature, visual context feature (e.g., entire image CNN feature [18], the visual difference between the objects belonging to the same category in the image [30] and context region CNN features [19]), object location feature and the word embedding to an LSTM to parameterize the distribution. Different from the previous work, recent work [32,4] adopts co-attention mechanisms to build up the interactions between the expression and the objects in the image.…”
Section: Referring Expression Comprehensionmentioning
confidence: 99%
See 2 more Smart Citations
“…To learn the common feature space, they propose different matching loss functions to optimize, e.g., softmax loss [16,21] and triplet loss [25]. Another work [18,30,19] learns to maximize the likelihood of the expression given the referent and the image, and the work inputs the fusion of visual object feature, visual context feature (e.g., entire image CNN feature [18], the visual difference between the objects belonging to the same category in the image [30] and context region CNN features [19]), object location feature and the word embedding to an LSTM to parameterize the distribution. Different from the previous work, recent work [32,4] adopts co-attention mechanisms to build up the interactions between the expression and the objects in the image.…”
Section: Referring Expression Comprehensionmentioning
confidence: 99%
“…Those approaches ignore the relationships among objects in the image and the linguistic structure in the expression, which is the key to referring expression comprehension. For an image, they represent the image as a set of independent visual objects [16,21,25,13,18] or compound objects only including direct relationship [19,30]. For an expression, they encode the expression sequentially and ignore the dependencies in the expression.…”
Section: Referring Expression Comprehensionmentioning
confidence: 99%
See 1 more Smart Citation
“…A common method is to first extract regions of interest from the image, using a region proposal network (RPN). Yu et al (2016); Mao et al (2016) decode these proposals as a caption using a recurrent neural network (RNN). The predicted region corresponds to the caption that is ranked most similar to the referring expression.…”
Section: Related Workmentioning
confidence: 99%
“…For testing, we use the training section of REFCOCO corpus collected by (Yu et al, 2016), which is based on the MSCOCO collection (Lin et al, 2014) containing over 300k images with object segmentations. This gives us a large enough test set to make stable predictions about the quality of individual word predictors, which often only have a few positive instances in the test set of the REFERIT corpus.…”
Section: Experimental Set-upmentioning
confidence: 99%