Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu 2018
DOI: 10.18653/v1/n18-1038
|View full text |Cite
|
Sign up to set email alerts
|

Learning Visually Grounded Sentence Representations

Abstract: We investigate grounded sentence representations, where we train a sentence encoder to predict the image features of a given captioni.e., we try to "imagine" how a sentence would be depicted visually-and use the resultant features as sentence representations. We examine the quality of the learned representations on a variety of standard sentence representation quality benchmarks, showing improved performance for grounded models over non-grounded ones. In addition, we thoroughly analyze the extent to which grou… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

1
55
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 50 publications
(56 citation statements)
references
References 51 publications
(50 reference statements)
1
55
0
Order By: Relevance
“…These textbased embeddings are trained to encode word-level semantic knowledge and have become a mainstay in work on sentence representations (e.g. [6,7]). When we want to learn language directly from speech, we will have to do so in a more end-to-end fashion, without prior lexical level knowledge in terms of both form and semantics.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…These textbased embeddings are trained to encode word-level semantic knowledge and have become a mainstay in work on sentence representations (e.g. [6,7]). When we want to learn language directly from speech, we will have to do so in a more end-to-end fashion, without prior lexical level knowledge in terms of both form and semantics.…”
Section: Introductionmentioning
confidence: 99%
“…In previous work [8] we used image-caption retrieval, where given a written caption the model must return the matching image and vice versa. We trained deep neural networks (DNNs) to create sentence embeddings without the use of prior knowledge of lexical semantics (see [7,9,10] for other studies on this task). The visually grounded sentence embeddings that arose capture semantic information about the sentence as measured by the Semantic Textual Similarity task (see [11]), performing comparably to text-only methods that require word embeddings.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, there have also been successful sentence encoder models which are trained on a supervised task and then transferred to other tasks (e.g. [10,11,12]).…”
Section: Introductionmentioning
confidence: 99%
“…So far, existing sentence embedding methods often require (pretrained) word embeddings [10,12], large amounts of data [8], or both [13,11]. While word embeddings are successful at enhancing sentence embeddings, they are not very plausible as a model of human language learning.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation