Learning semantic sentence representations from visually grounded language without lexical knowledge

Merkx, Danny; Frank, Stefan L.

doi:10.1017/s1351324919000196

Cited by 12 publications

(27 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We take the cosine similarity cos(x, y) and subtract the similarity of the mismatched pairs from the matching pairs such that the loss is only zero when the matching pair is more similar than the mismatched pairs by a margin α. We use importance sampling to select the mismatched pairs; rather than using all the other samples in the mini-batch as mismatched pairs (as done in [8,15]), we calculate the loss using only the hardest examples (i.e. mismatched pairs with high cosine similarity).…”

Section: Trainingmentioning

confidence: 99%

Language Learning Using Speech to Image Retrieval

2019

Self Cite

View full text Add to dashboard Cite

Humans learn language by interaction with their environment and listening to other humans. It should also be possible for computational models to learn language directly from speech but so far most approaches require text. We improve on existing neural network approaches to create visually grounded embeddings for spoken utterances. Using a combination of a multi-layer GRU, importance sampling, cyclic learning rates, ensembling and vectorial self-attention our results show a remarkable increase in image-caption retrieval performance over previous work. Furthermore, we investigate which layers in the model learn to recognise words in the input. We find that deeper network layers are better at encoding word presence, although the final layer has slightly lower performance. This shows that our visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition.

show abstract

Section: Trainingmentioning

confidence: 99%

Language Learning Using Speech to Image Retrieval

2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Another evaluation method for sentence-level semantics is to compare learned sentence similarities to human similarity judgments (e.g. Merkx and Frank, 2019).…”

Section: Related Work and Noveltymentioning

confidence: 99%

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

2021

View full text Add to dashboard Cite

Word concreteness and imageability have proven crucial in understanding how humans process and represent language in the brain. While word-embeddings do not explicitly incorporate the concreteness of words into their computations, they have been shown to accurately predict human judgments of concreteness and imageability. Inspired by the recent interest in using neural activity patterns to analyze distributed meaning representations, we first show that brain responses acquired while human subjects passively comprehend natural stories can significantly distinguish the concreteness levels of the words encountered. We then examine for the same task whether the additional perceptual information in the brain representations can complement the contextual information in the word-embeddings. However, the results of our predictive models and residual analyses indicate the contrary. We find that the relevant information in the brain representations is a subset of the relevant information in the contextualized wordembeddings, providing new insight into the existing state of natural language processing models.

show abstract

“…The papers in this special issue investigate three main aspects that are under debate in the current research: (a) the capability of end-to-end trained neural networks (NNs) to learn sentence representations with no a priori assumption as to the existence or specifics of the interface between syntax and semantics in natural languages; (b) the need to combine NNs with formal structures defined a priori, following some theoretical assumptions on language syntax and its interplay with semantics; and (c) the importance of developing explainable models that are transparent in their findings and whose decisions are traceable. Maillard, Clark, and Yogatama (2019); Merkx and Frank (2019); Talman, Yli-Jyrä, and Tiedemann (2019) study the extent to which a neural model can learn sentence representations in an end-toend fashion. In particular, Maillard et al (2019) let the model start from word embeddings to learn syntax and semantic structures jointly through a downstream task and without ever seeing gold standard parse trees.…”

Section: Selected Papersmentioning

confidence: 99%

“…The syntactic ambiguity of the learned representations is only briefly mentioned, but the proposed models clearly have the capacity of learning to model it. Merkx and Frank (2019) put the emphasis on not having a priori assumptions about lexical meaning: the model learns sentence representations by learning to retrieve the captions from images and vice versa. It is then evaluated on Semantic Textual Similarity tasks shown to correlate with human judgement quite well, but it does not reach high performance on the entailment task.…”

Section: Selected Papersmentioning

confidence: 99%