Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confere 2015
DOI: 10.3115/v1/p15-2019
|View full text |Cite
|
Sign up to set email alerts
|

Learning language through pictures

Abstract: We propose IMAGINET, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task objective by receiving a textual description of a scene and trying to concurrently predict its visual representation and the next word in the sentence. Mimicking an important aspect of human language learning, it acquires meaning representations for individual words from descriptio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
52
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 52 publications
(58 citation statements)
references
References 34 publications
1
52
0
Order By: Relevance
“…This surprising result is largely due to the fact that the translators did not see the images while providing ground truth translations. More importantly, the effectiveness of visual information in machine translation in a privileged setting is also intuitive following the results of [5]. Chrupala et al [5] show that when image information is used as privileged information in the learning of word representations, the quality of such representations increases.…”
Section: Image Classification With Privileged Localizationmentioning
confidence: 84%
See 2 more Smart Citations
“…This surprising result is largely due to the fact that the translators did not see the images while providing ground truth translations. More importantly, the effectiveness of visual information in machine translation in a privileged setting is also intuitive following the results of [5]. Chrupala et al [5] show that when image information is used as privileged information in the learning of word representations, the quality of such representations increases.…”
Section: Image Classification With Privileged Localizationmentioning
confidence: 84%
“…Learning Language under Privileged Visual Information: Using images as privileged information to learn language is not new. Chrupala et al [5] used a multi-task loss while learning word embeddings under privileged visual information. The embeddings are trained for the task of predicting the next word, as well the representation of the image.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Roy and Pentland, 2002;Yu and Ballard, 2004;Lazaridou et al, 2016). Chrupała et al (2015) introduce a model that learns to predict the visual context from image captions. The model is trained on image-caption pairs from MSCOCO (Lin et al, 2014), capturing both rich visual input as well as larger scale input, but the language input still consists of word symbols.…”
Section: Related Workmentioning
confidence: 99%
“…The challenge with textual data is the discrete nature of the input: they use the Gumbel softmax trick (Jang et al 2017) to generate word sequences which maximize activations for particular neurons. They apply this method to the Imaginet architecture of Chrupała et al (2015) and confirm one of the findings in Kádár et al (2017): that the language model part of the Imaginet architecture is more sensitive to function words than the visual part, which tends to ignore them. They also carry out a separate quantitative evaluation of the synthetic patterns vs corpus-attested in terms of achieved maximum activation.…”
Section: Saliency In Recurrent Networkmentioning
confidence: 55%