Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018
DOI: 10.18653/v1/p18-1085
|View full text |Cite
|
Sign up to set email alerts
|

Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search

Abstract: We introduce Picturebook, a large-scale lookup operation to ground language via 'snapshots' of our physical world accessed through image search. For each word in a vocabulary, we extract the top-k images from Google image search and feed the images through a convolutional network to extract a word embedding. We introduce a multimodal gating function to fuse our Picturebook embeddings with other word representations. We also introduce Inverse Picturebook, a mechanism to map a Picturebook embedding back into wor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
40
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 43 publications
(43 citation statements)
references
References 57 publications
0
40
0
Order By: Relevance
“…Yang et al (2017) introduced a method for choosing between word and character embeddings while method for word embedding selection. Gating has also been widely applied to multimodal fusion (Arevalo et al, 2017;Wang et al, 2018b;Kiros et al, 2018). Our work is also related to recent methods that induce contextualized word representations (Mc-Cann et al, 2017;Peters et al, 2018) as well as pre-training language models for task-dependent fine-tuning (Dai and Le, 2015; Howard and Ruder, 2018;Radford et al, 2018).…”
Section: Related Workmentioning
confidence: 83%
See 1 more Smart Citation
“…Yang et al (2017) introduced a method for choosing between word and character embeddings while method for word embedding selection. Gating has also been widely applied to multimodal fusion (Arevalo et al, 2017;Wang et al, 2018b;Kiros et al, 2018). Our work is also related to recent methods that induce contextualized word representations (Mc-Cann et al, 2017;Peters et al, 2018) as well as pre-training language models for task-dependent fine-tuning (Dai and Le, 2015; Howard and Ruder, 2018;Radford et al, 2018).…”
Section: Related Workmentioning
confidence: 83%
“…We also experimented with additional embedding types, including Picturebook (Kiros et al, 2018), knowledge graph and neural machine translation based embeddings. While adding these embeddings improved performance on NLI, they did not lead to any performance gains on downstream tasks.…”
Section: Limitationsmentioning
confidence: 99%
“…We also adapted our approach to a visual dialogue task and achieved excellent performance. A possible improvement to our work is adding pre-trained embedding such as BERT (Devlin et al, 2018) or image-grounded word embedding (Kiros et al, 2018) to improve the semantic understanding capability of the models. his = 10 due to memory issues with large input sequences.…”
Section: Resultsmentioning
confidence: 99%
“…In EXP 2.1, we compare our model with the Inception V3 network (Ioffe and Szegedy, 2015) for the visual stimuli, and in EXP 2.2 with the SoundNet (Aytar et al, 2016) for the auditory stimuli. These two models present competitive results on different audio-visual recognition tasks (Jansen et al, 2018;Jiang et al, 2018;Kiros et al, 2018;Kumar et al, 2018). For all experiments, we trained the models 10 times and determined the mean accuracy and standard deviation for each modality.…”
Section: Methodsmentioning
confidence: 99%