Resolving vision and language ambiguities together: Joint segmentation &amp; prepositional attachment resolution in captioned scenes

Christie, Gordon; Laddha, Ankit; Agrawal, Aishwarya; Antol, Stanislaw; Goyal, Yash; Kochersberger, Kevin; Batra, Dhruv

doi:10.1016/j.cviu.2017.09.001

Cited by 18 publications

(14 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work could be extended in several ways, including by (i) using the knowledge about the bias of spatial relations to evaluate captioning tasks with spatial word substitutions (Shekhar et al, 2017a,b); (ii) examining how functional knowledge is complemented with visual knowledge in language generation (Christie et al, 2016; Delecraz et al, 2017) (iii) using different contextual embeddings such as ELMo (Peters et al, 2018) and BERT (Devlin et al, 2018) for the embedding layer of the generative language model rather than our specifically-trained word embeddings; note that P-vectors are representations of collections of context based on the performance of the decoder language model while ELMo and BERT are representations of specific context based on the encoder language model; (iv) comparing language models for spatial descriptions from different pragmatic tasks. As the focus of image captioning is to best describe the image and not for example, spatially locate a particular object, the pragmatic context of image descriptions is biased towards the functional sense of spatial relations.…”

Section: Discussionmentioning

confidence: 99%

What a neural language model tells us about spatial relations

Ghanimifard¹,

Dobnik²

2019

Proceedings of the Combined Workshop on Spatial Language Understanding (

View full text Add to dashboard Cite

Understanding and generating spatial descriptions requires knowledge about what objects are related, their functional interactions, and where the objects are geometrically located. Different spatial relations have different functional and geometric bias. The wide usage of neural language models in different areas including generation of image description motivates the study of what kind of knowledge is encoded in neural language models about individual spatial relations. With the premise that the functional bias of relations is expressed in their word distributions, we construct multi-word distributional vector representations and show that these representations perform well on intrinsic semantic reasoning tasks, thus confirming our premise. A comparison of our vector representations to human semantic judgments indicates that different bias (functional or geometric) is captured in different data collection tasks which suggests that the contribution of the two meaning modalities is dynamic, related to the context of the task.

show abstract

Section: Discussionmentioning

confidence: 99%

What a neural language model tells us about spatial relations

Ghanimifard¹,

Dobnik²

2019

Proceedings of the Combined Workshop on Spatial Language Understanding (

View full text Add to dashboard Cite

show abstract

“…A different perspective has been addressed in [21], where the problem of PP attachment ambiguity of images' caption is resolved by leveraging the corresponding image. In particular, the authors propose a joint resolution of both semantic segmentation of the image and prepositional phrase attachment.…”

Section: Related Workmentioning

confidence: 99%

Grounded language interpretation of robotic commands through structured learning

Vanzo

Croce

Bastianelli

et al. 2020

Artificial Intelligence

View full text Add to dashboard Cite

show abstract

“…To our knowledge, there are not too many works using multimodal information to deal with this problem. The most relevant work to us is [4]; their approach consists in simultaneously perform object segmentation and PP-attachment resolution for captioned images. In order to do that, they produce a set of possible hypothesis for both tasks, and then they jointly rerank them to select the most consistent pair.…”

Section: Related Workmentioning

confidence: 99%

Visual Disambiguation of Prepositional Phrase Attachments: Multimodal Machine Learning for Syntactic Analysis Correction

Delecraz

Becerra-Bonache

Nasr

et al. 2019

Advances in Computational Intelligence

View full text Add to dashboard Cite

Prepositional phrase attachments are known to be an important source of errors in parsing natural language. In some cases, pure syntactic features cannot be used for prepositional phrase attachment disambiguation while visual features could help. In this work, we are interested in the impact of the integration of such features in a parsing system. We propose a correction strategy pipeline for prepositional attachments using visual information, trained on a multimodal corpus of images and captions. The evaluation of the system shows us that using visual features allows, in certain cases, to correct the errors of a parser. It also helps to identify the most difficult aspects of such integration.

show abstract

Resolving vision and language ambiguities together: Joint segmentation & prepositional attachment resolution in captioned scenes

Cited by 18 publications

References 9 publications

What a neural language model tells us about spatial relations

What a neural language model tells us about spatial relations

Grounded language interpretation of robotic commands through structured learning

Visual Disambiguation of Prepositional Phrase Attachments: Multimodal Machine Learning for Syntactic Analysis Correction

Contact Info

Product

Resources

About