Combining Language and Vision with a Multimodal Skip-gram Model

Lazaridou, Angeliki; Baroni, Marco

doi:10.3115/v1/n15-1016

Cited by 231 publications

(249 citation statements)

References 33 publications

Supporting

Mentioning

240

Contrasting

Order By: Relevance

“…On the other hand, the textual attributes fall short compared to the skip-gram embeddings (tAttrib vs. skip-gram, T). The bimodal SAE trained on the latter (skip-gram, vAttrib; T+V) is the overall best model, outperforming SVD and CCA (skip-gram, vAttrib, T+V), Lazaridou et al [40], Bruni et al [37], and all concatenation models. It yields a correlation coefficient of ρ =.77 on semantic similarity and ρ = 0.66 on visual similarity.…”

Section: Resultsmentioning

confidence: 99%

“…We report results with our SAE model, SVD, and the CCA models using (automatically obtained) textual and visual attributes (tAttrib, vAttrib) or skip-gram embeddings and visual attributes (skip-gram, vAttrib). We also compare SAE to Lazaridou et al's [40] multimodal skip-gram model and Bruni et al [37]. The third section of the table presents concatenation models using our textual and visual attributes (tAttrib, vAttrib), skip-gram embeddings, CNN features, and combinations thereof.…”

Section: Resultsmentioning

confidence: 99%

“…Several models ( [15], [19], [22]) present extensions of Latent Dirich-let Allocation (LDA, [42]) where topic distributions are learned from words and other perceptual units treating them both as observed variables. Hill and Korhonen [34] extend the skip-gram network model [33] in a similar fashion, perceptual input is encoded verbally and treated as a word's linguistic context, whereas Lazaridou et al [40] modify skipgram's learning objective so that representations are trained to predict linguistic and visual features. In most cases the visual and textual modalities are decoupled and obtained independently, i.e., from text corpora and feature norms or image databases (but see [19] for an exception).…”

Section: Grounded Semantic Spacesmentioning

confidence: 99%

“…Lazaridou et al's [40] model takes into account visual information during training by adding a visual objective 17 to the text-based skip-gram objective. The visual objective is to maximize the similarity between the fixed visual vector of a target word (when available) and its textual embedding to be learned using a max-margin framework.…”

Section: Comparison Modelsmentioning

confidence: 99%

“…Drawing inspiration from the successful application of attribute classifiers in object recognition, Silberer et al [21] show that automatically predicted visual attributes from images can act as substitutes for feature norms without any critical information loss. In other work ( [39], [40]) representations for the visual modality are obtained directly from image pixels using the feature extraction layers of a deep convolutional neural network (CNN) trained on a large labeled object recognition data set. Finally, some models use human generated image tags as a proxy for visual information ( [20], [34]).…”

Section: Grounded Semantic Spacesmentioning

confidence: 99%

See 4 more Smart Citations

Visually Grounded Meaning Representations

Silberer

Ferrari

Lapata

2017

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Abstract-In this paper we address the problem of grounding distributional representations of lexical meaning. We introduce a new model which uses stacked autoencoders to learn higher-level representations from textual and visual input. The visual modality is encoded via vectors of attributes obtained automatically from images. We create a new large-scale taxonomy of 600 visual attributes representing more than 500 concepts and 700K images. We use this dataset to train attribute classifiers and integrate their predictions with text-based distributional models of word meaning. We evaluate our model on its ability to simulate word similarity judgments and concept categorization. On both tasks, our model yields a better fit to behavioral data compared to baselines and related models which either rely on a single modality or do not make use of attribute-based input.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Grounded Semantic Spacesmentioning

confidence: 99%

Section: Comparison Modelsmentioning

confidence: 99%

Section: Grounded Semantic Spacesmentioning

confidence: 99%

See 3 more Smart Citations

Visually Grounded Meaning Representations

Silberer

Ferrari

Lapata

2017

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

The wicked problem of naming the intangible: Abstract concepts, binary thinking, and computer vision labels

Martinez Pandiani

2024

Future Humanities

View full text Add to dashboard Cite

Delving into the intricate complexities of naming and categorizing the visual evocation of abstract concepts, this paper brings to light the limitations of relying on binary thinking to tackle these inherently intricate “wicked problems.” As computer vision applications rapidly expand, the pressing challenge of accurately labeling these abstract concepts in visual media comes into focus, necessitating a close examination of the interplay between visual data, nuanced cultural meanings, and artificial intelligence (AI). This work discusses the role these concepts play in automatic visual indexing, as well as the ways in which they expose how binary frameworks curtail technical performance and perpetuate power dynamics. To address this, the paper draws upon insights from recent cognitive neuroscience research and advocates for a more comprehensive, queer, and situated understanding of these concepts. This approach highlights the significance of humanistic and ethical perspectives in shaping the trajectory of AI development.

show abstract