Learning Visually Grounded Sentence Representations

Kiela, Douwe; Conneau, Alexis; Jabri, Allan; Nickel, Maximilian

doi:10.18653/v1/n18-1038

Cited by 50 publications

(56 citation statements)

References 51 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These textbased embeddings are trained to encode word-level semantic knowledge and have become a mainstay in work on sentence representations (e.g. [6,7]). When we want to learn language directly from speech, we will have to do so in a more end-to-end fashion, without prior lexical level knowledge in terms of both form and semantics.…”

Section: Introductionmentioning

confidence: 99%

“…In previous work [8] we used image-caption retrieval, where given a written caption the model must return the matching image and vice versa. We trained deep neural networks (DNNs) to create sentence embeddings without the use of prior knowledge of lexical semantics (see [7,9,10] for other studies on this task). The visually grounded sentence embeddings that arose capture semantic information about the sentence as measured by the Semantic Textual Similarity task (see [11]), performing comparably to text-only methods that require word embeddings.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Language Learning Using Speech to Image Retrieval

2019

View full text Add to dashboard Cite

Humans learn language by interaction with their environment and listening to other humans. It should also be possible for computational models to learn language directly from speech but so far most approaches require text. We improve on existing neural network approaches to create visually grounded embeddings for spoken utterances. Using a combination of a multi-layer GRU, importance sampling, cyclic learning rates, ensembling and vectorial self-attention our results show a remarkable increase in image-caption retrieval performance over previous work. Furthermore, we investigate which layers in the model learn to recognise words in the input. We find that deeper network layers are better at encoding word presence, although the final layer has slightly lower performance. This shows that our visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Language Learning Using Speech to Image Retrieval

2019

View full text Add to dashboard Cite

show abstract

“…Recently, there have also been successful sentence encoder models which are trained on a supervised task and then transferred to other tasks (e.g. [10,11,12]).…”

Section: Introductionmentioning

confidence: 99%

“…So far, existing sentence embedding methods often require (pretrained) word embeddings [10,12], large amounts of data [8], or both [13,11]. While word embeddings are successful at enhancing sentence embeddings, they are not very plausible as a model of human language learning.…”

Section: Introductionmentioning

confidence: 99%

“…For sentence semantics, the multimodal task of image-caption retrieval, where given a caption the model must return the matching image and vice versa, has been proposed as a way of grounding sentence representations in vision [20,21]. Recently [12] found that such models do indeed produce embeddings that are useful in tasks like natural language inference, sentiment analysis and subjectivity/objectivity classification.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning semantic sentence representations from visually grounded language without lexical knowledge

Merkx

Frank

2019

Nat. Lang. Eng.

View full text Add to dashboard Cite

Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state-of-the-art on two popular image-caption retrieval benchmark data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-theart results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

show abstract