2017
DOI: 10.1016/j.cviu.2017.09.001
|View full text |Cite
|
Sign up to set email alerts
|

Resolving vision and language ambiguities together: Joint segmentation & prepositional attachment resolution in captioned scenes

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
13
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(14 citation statements)
references
References 9 publications
1
13
0
Order By: Relevance
“…Our work could be extended in several ways, including by (i) using the knowledge about the bias of spatial relations to evaluate captioning tasks with spatial word substitutions (Shekhar et al, 2017a,b); (ii) examining how functional knowledge is complemented with visual knowledge in language generation (Christie et al, 2016; Delecraz et al, 2017) (iii) using different contextual embeddings such as ELMo (Peters et al, 2018) and BERT (Devlin et al, 2018) for the embedding layer of the generative language model rather than our specifically-trained word embeddings; note that P-vectors are representations of collections of context based on the performance of the decoder language model while ELMo and BERT are representations of specific context based on the encoder language model; (iv) comparing language models for spatial descriptions from different pragmatic tasks. As the focus of image captioning is to best describe the image and not for example, spatially locate a particular object, the pragmatic context of image descriptions is biased towards the functional sense of spatial relations.…”
Section: Discussionmentioning
confidence: 99%
“…Our work could be extended in several ways, including by (i) using the knowledge about the bias of spatial relations to evaluate captioning tasks with spatial word substitutions (Shekhar et al, 2017a,b); (ii) examining how functional knowledge is complemented with visual knowledge in language generation (Christie et al, 2016; Delecraz et al, 2017) (iii) using different contextual embeddings such as ELMo (Peters et al, 2018) and BERT (Devlin et al, 2018) for the embedding layer of the generative language model rather than our specifically-trained word embeddings; note that P-vectors are representations of collections of context based on the performance of the decoder language model while ELMo and BERT are representations of specific context based on the encoder language model; (iv) comparing language models for spatial descriptions from different pragmatic tasks. As the focus of image captioning is to best describe the image and not for example, spatially locate a particular object, the pragmatic context of image descriptions is biased towards the functional sense of spatial relations.…”
Section: Discussionmentioning
confidence: 99%
“…A different perspective has been addressed in [21], where the problem of PP attachment ambiguity of images' caption is resolved by leveraging the corresponding image. In particular, the authors propose a joint resolution of both semantic segmentation of the image and prepositional phrase attachment.…”
Section: Related Workmentioning
confidence: 99%
“…To our knowledge, there are not too many works using multimodal information to deal with this problem. The most relevant work to us is [4]; their approach consists in simultaneously perform object segmentation and PP-attachment resolution for captioned images. In order to do that, they produce a set of possible hypothesis for both tasks, and then they jointly rerank them to select the most consistent pair.…”
Section: Related Workmentioning
confidence: 99%