2019
DOI: 10.1017/s1351324919000196
|View full text |Cite
|
Sign up to set email alerts
|

Learning semantic sentence representations from visually grounded language without lexical knowledge

Abstract: Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the correspondin… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
24
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 12 publications
(27 citation statements)
references
References 44 publications
2
24
1
Order By: Relevance
“…We take the cosine similarity cos(x, y) and subtract the similarity of the mismatched pairs from the matching pairs such that the loss is only zero when the matching pair is more similar than the mismatched pairs by a margin α. We use importance sampling to select the mismatched pairs; rather than using all the other samples in the mini-batch as mismatched pairs (as done in [8,15]), we calculate the loss using only the hardest examples (i.e. mismatched pairs with high cosine similarity).…”
Section: Trainingmentioning
confidence: 99%
“…We take the cosine similarity cos(x, y) and subtract the similarity of the mismatched pairs from the matching pairs such that the loss is only zero when the matching pair is more similar than the mismatched pairs by a margin α. We use importance sampling to select the mismatched pairs; rather than using all the other samples in the mini-batch as mismatched pairs (as done in [8,15]), we calculate the loss using only the hardest examples (i.e. mismatched pairs with high cosine similarity).…”
Section: Trainingmentioning
confidence: 99%
“…Another evaluation method for sentence-level semantics is to compare learned sentence similarities to human similarity judgments (e.g. Merkx and Frank, 2019).…”
Section: Related Work and Noveltymentioning
confidence: 99%
“…The papers in this special issue investigate three main aspects that are under debate in the current research: (a) the capability of end-to-end trained neural networks (NNs) to learn sentence representations with no a priori assumption as to the existence or specifics of the interface between syntax and semantics in natural languages; (b) the need to combine NNs with formal structures defined a priori, following some theoretical assumptions on language syntax and its interplay with semantics; and (c) the importance of developing explainable models that are transparent in their findings and whose decisions are traceable. Maillard, Clark, and Yogatama (2019); Merkx and Frank (2019); Talman, Yli-Jyrä, and Tiedemann (2019) study the extent to which a neural model can learn sentence representations in an end-toend fashion. In particular, Maillard et al (2019) let the model start from word embeddings to learn syntax and semantic structures jointly through a downstream task and without ever seeing gold standard parse trees.…”
Section: Selected Papersmentioning
confidence: 99%
“…The syntactic ambiguity of the learned representations is only briefly mentioned, but the proposed models clearly have the capacity of learning to model it. Merkx and Frank (2019) put the emphasis on not having a priori assumptions about lexical meaning: the model learns sentence representations by learning to retrieve the captions from images and vice versa. It is then evaluated on Semantic Textual Similarity tasks shown to correlate with human judgement quite well, but it does not reach high performance on the entailment task.…”
Section: Selected Papersmentioning
confidence: 99%