Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2015
DOI: 10.3115/v1/n15-1016
|View full text |Cite
|
Sign up to set email alerts
|

Combining Language and Vision with a Multimodal Skip-gram Model

Abstract: We extend the SKIP-GRAM model of Mikolov et al. (2013a) by taking visual information into account. Like SKIP-GRAM, our multimodal models (MMSKIP-GRAM) build vector-based word representations by learning to predict linguistic contexts in text corpora. However, for a restricted set of words, the models are also exposed to visual representations of the objects they denote (extracted from natural images), and must predict linguistic and visual features jointly. The MMSKIP-GRAM models achieve good performance on a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

4
240
1

Year Published

2017
2017
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 231 publications
(249 citation statements)
references
References 33 publications
4
240
1
Order By: Relevance
“…On the other hand, the textual attributes fall short compared to the skip-gram embeddings (tAttrib vs. skip-gram, T). The bimodal SAE trained on the latter (skip-gram, vAttrib; T+V) is the overall best model, outperforming SVD and CCA (skip-gram, vAttrib, T+V), Lazaridou et al [40], Bruni et al [37], and all concatenation models. It yields a correlation coefficient of ρ =.77 on semantic similarity and ρ = 0.66 on visual similarity.…”
Section: Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…On the other hand, the textual attributes fall short compared to the skip-gram embeddings (tAttrib vs. skip-gram, T). The bimodal SAE trained on the latter (skip-gram, vAttrib; T+V) is the overall best model, outperforming SVD and CCA (skip-gram, vAttrib, T+V), Lazaridou et al [40], Bruni et al [37], and all concatenation models. It yields a correlation coefficient of ρ =.77 on semantic similarity and ρ = 0.66 on visual similarity.…”
Section: Resultsmentioning
confidence: 99%
“…We report results with our SAE model, SVD, and the CCA models using (automatically obtained) textual and visual attributes (tAttrib, vAttrib) or skip-gram embeddings and visual attributes (skip-gram, vAttrib). We also compare SAE to Lazaridou et al's [40] multimodal skip-gram model and Bruni et al [37]. The third section of the table presents concatenation models using our textual and visual attributes (tAttrib, vAttrib), skip-gram embeddings, CNN features, and combinations thereof.…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations