Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing 2016
DOI: 10.18653/v1/d16-1156
|View full text |Cite
|
Sign up to set email alerts
|

Resolving Language and Vision Ambiguities Together: Joint Segmentation and Prepositional Attachment Resolution in Captioned Scenes

Abstract: We present an approach to simultaneously perform semantic segmentation and prepositional phrase attachment resolution for captioned images. Some ambiguities in language cannot be resolved without simultaneously reasoning about an associated image. If we consider the sentence "I shot an elephant in my pajamas", looking at language alone (and not using common sense), it is unclear if it is the person or the elephant wearing the pajamas or both. Our approach produces a diverse set of plausible hypotheses for both… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
4
1
1

Relationship

1
9

Authors

Journals

citations
Cited by 15 publications
(6 citation statements)
references
References 28 publications
0
6
0
Order By: Relevance
“…Adding visual information provides grounding for words, potentially allowing the networks to learn references from words to objects, or at least visual features in the input (Hill et al., 2021; Vong & Lake, 2022). Multimodal learning has been shown to help resolve ambiguities when only linguistic information is present (Berzak, Barbu, Harari, Katz, & Ullman, 2015; Christie et al., 2016), induce constituent structures (Shi, Mao, Gimpel, & Livescu, 2019), and ground events described in language to video (Siddharth, Barbu, & Siskind, 2014; Yu, Siddharth, Barbu, & Siskind, 2015).…”
Section: Neural Network and Trainingmentioning
confidence: 99%
“…Adding visual information provides grounding for words, potentially allowing the networks to learn references from words to objects, or at least visual features in the input (Hill et al., 2021; Vong & Lake, 2022). Multimodal learning has been shown to help resolve ambiguities when only linguistic information is present (Berzak, Barbu, Harari, Katz, & Ullman, 2015; Christie et al., 2016), induce constituent structures (Shi, Mao, Gimpel, & Livescu, 2019), and ground events described in language to video (Siddharth, Barbu, & Siskind, 2014; Yu, Siddharth, Barbu, & Siskind, 2015).…”
Section: Neural Network and Trainingmentioning
confidence: 99%
“…Visually-aided Language Learning Previous research attempt to introduce visual information to improve language learning on various Natural Language Processing (NLP) scenarios, including but not limited to machine translation (Grubinger et al, 2006;Elliott et al, 2016), information retrieval (Funaki and Nakayama, 2015;Gu et al, 2018), semantic parsing (Christie et al, 2016;Shi et al, 2019), natural language inference (Xie et al, 2019), bilingual lexicon learning (Kiela et al, 2015;Vulic et al, 2016), natural language generation evaluation (Zhu et al, 2021), spatial commonsense reasoning (Liu et al, 2022) and language representation learning (Lazaridou et al, 2015;Collell et al, 2017;Kiela et al, 2018;Bordes et al, 2019;Lu et al, 2019;Li et al, 2019;Luo et al, 2020;Li et al, 2020;Tan and Bansal, 2020;Radford et al, 2021). While most of these studies acquire visual information through retrieval from the web or large-scale image sets, a recent line of studies attempt to generate visual supervision from scratch.…”
Section: Related Workmentioning
confidence: 99%
“…Visually-aided Language Learning Previous research attempt to introduce visual information to improve language learning on various Natural Language Processing (NLP) scenarios, including but not limited to machine translation (Grubinger et al, 2006;Elliott et al, 2016), information retrieval (Funaki and Nakayama, 2015;Gu et al, 2018), semantic parsing (Christie et al, 2016;Shi et al, 2019), natural language inference (Xie et al, 2019), bilingual lexicon learning (Kiela et al, 2015;Vulic et al, 2016), natural language generation evaluation (Zhu et al, 2021), spatial commonsense reasoning (Liu et al, 2022) and language representation learning (Lazaridou et al, 2015;Collell et al, 2017;Kiela et al, 2018;Bordes et al, 2019;Lu et al, 2019;Li et al, 2019;Luo et al, 2020;Li et al, 2020;Tan and Bansal, 2020;Radford et al, 2021). While most of these studies acquire visual information through retrieval from the web or large-scale image sets, a recent line of studies attempt to generate visual supervision from scratch.…”
Section: Related Workmentioning
confidence: 99%