Understanding Image Representations by Measuring Their Equivariance and Equivalence

Lenc, Karel; Vedaldi, Andrea

doi:10.1007/s11263-018-1098-y

Cited by 58 publications

(20 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further, “efference copy” signals (Colby et al, 1992; Crapse and Sommer 2007), which signal the magnitude and direction of movements between samples, might also lead to predictable shifts in the embedding space. This intrinsic information about the sampling process could enable the system to learn representations that are “equivariant”, as opposed to “invariant”, over identity-preserving transformations (c.f., Lenc and Vedaldi, 2015; Bouchacourt et al, 2021).…”

Section: Discussionmentioning

confidence: 99%

A self-supervised domain-general learning framework for human ventral stream representation

Konkle

Alvarez

2020

Preprint

View full text Add to dashboard Cite

Humans learn object categories without millions of labels, but to date the models with the highest correspondence to primate visual systems are all categorysupervised. This paper introduces a new self-supervised learning framework: instance-prototype contrastive learning (IPCL), and compares the internal representations learned by this model and other instance-level contrastive learning systems to the structure of human brain responses. We present the first evidence to date showing that self-supervised systems can show more brain-like representation than category-supervised models. Further, we find that recent substantial gains in top-1 accuracy from instance-wise contrastive learning models do not result in more brain-like representation-instead we find the architecture and normalization scheme are critical. Finally, this dataset reveals substantial representational structure in intermediate and late stages of the human visual system that is not accounted for by any model, whether self-supervised or category-supervised. Considering both neuroscience and machine vision perspectives, these results provide promise for instance-level representation as a key objective of visual system encoding, and highlight the room to grow towards more robust, efficient, human-like object representation. ⇤ Preprint. Under review. Stefania Bracci, J Brendan Ritchie, Ioannis Kalfas, and Hans P Op de Beeck. The ventral visual pathway represents animal appearance over animacy, unlike human behavior and deep neural networks. simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020a. Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b. Radoslaw Martin Cichy, Dimitrios Pantazis, and Aude Oliva. Similarity-based fusion of meg and fmri reveals spatio-temporal dynamics in human cortex during visual object recognition. Cerebral Cortex, 26(8):3563-3579, 2016. CL Colby, ME Goldberg, et al. The updating of the representation of visual space in parietal cortex by intended eye movements. Science, 255(5040):90-92, 1992. 8 Trinity B Crapse and Marc A Sommer. Corollary discharge across the animal kingdom. Nature Reviews Neuroscience, 9(8):587-600, 2008. Hans P Op de Beeck, Ineke Pillet, and J Brendan Ritchie. Factors determining where categoryselective areas emerge in visual cortex. Trends in cognitive sciences, 2019. Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In . Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018. Anthony G Greenwald, Brian A Nosek, and Mahzarin R Banaji. Understanding and using the implicit association test: I. an improved scoring algorithm. Journal of personality and social psychology, 85(2):197, 2003. Kalanit Grill-Spector and Kevin S Weiner. The functional architecture of the ventral temporal cortex and its role in categorization. Nature Reviews Neuroscienc...

show abstract

Section: Discussionmentioning

confidence: 99%

A self-supervised domain-general learning framework for human ventral stream representation

Konkle

Alvarez

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…To handle such ambiguity, the image generation in RSL-Net [39] is conditional on both a satellite image and a live radar image, where the exact appearance of the synthetic image is dictated by the live radar image, and the synthetic image is pixel-wise aligned to the satellite image. In particular, as CNNs are non-equivariant 2 to rotation [26], RSL-Net seeks to infer the rotation offset prior to image generation:…”

Section: B Image Generation Vs Point Learningmentioning

confidence: 99%

Get to the Point: Learning Lidar Place Recognition and Metric Localisation Using Overhead Imagery

Tang¹,

Martini²,

Newman³

2021

Robotics: Science and Systems XVII

View full text Add to dashboard Cite

This paper is about localising a robot in overhead images using lidar. Specifically, we show how to solve both place recognition and metric localisation of a lidar using only publicly available overhead imagery as a map proxy. This is in contrast to current approaches that rely on prior sensor maps. To handle the drastic modality difference (overhead image vs. on the ground lidar), our method learns a representation that purposely and suitably transforms a given overhead image into a collection of 2D points, allowing for direct comparison against lidar scans. After both modalities are expressed as points, point-based methods can then be leveraged to learn the registration and place recognition task. Our method is the first to learn the place recognition of a lidar using only overhead imagery, and outperforms prior work for metric localisation with large initial pose offsets.

show abstract

“…In an attempt to better understand the properties of a CNN, some recent vision works have focused on analyzing their internal representations (Szegedy et al 2014;Yosinski et al 2014;Lenc and Vedaldi 2015;Mahendran and Vedaldi 2015;Zeiler and Fergus 2014;Simonyan et al 2014;Agrawal et al 2014;Zhou et al 2015;Eigen et al 2013). Some of these investigated properties of the network, like stability (Szegedy et al 2014), feature transferability (Yosinski et al 2014), equivariance, invariance and equivalence (Lenc and Vedaldi 2015), the ability to reconstruct the input (Mahendran and Vedaldi 2015) and how the number of layers, filters and parameters affects the network performance (Agrawal et al 2014;Eigen et al 2013). Zeiler and Fergus (2014) use deconvolutional networks to visualize locally optimal visual inputs for individual filters.…”

Section: Related Workmentioning

confidence: 99%

Do Semantic Parts Emerge in Convolutional Neural Networks?

2017

View full text Add to dashboard Cite

Semantic object parts can be useful for several visual recognition tasks. Lately, these tasks have been addressed using Convolutional Neural Networks (CNN), achieving outstanding results. In this work we study whether CNNs learn semantic parts in their internal representation. We investigate the responses of convolutional filters and try to associate their stimuli with semantic parts. We perform two extensive quantitative analyses. First, we use groundtruth part bounding-boxes from the PASCAL-Part dataset to determine how many of those semantic parts emerge in the CNN. We explore this emergence for different layers, network depths, and supervision levels. Second, we collect human judgements in order to study what fraction of all filters systematically fire on any semantic part, even if not annotated in PASCAL-Part. Moreover, we explore several connections between discriminative power and semantics. We find out which are the most discriminative filters for object recognition, and analyze whether they respond to semantic parts or to other image patches. We also investigate the other direction: we determine which semantic parts are the most discriminative and whether they correspond to those parts emerging in the network. This enables to gain an even deeper understanding of the role of semantic parts in the network.Communicated by

show abstract

Understanding Image Representations by Measuring Their Equivariance and Equivalence

Cited by 58 publications

References 50 publications

A self-supervised domain-general learning framework for human ventral stream representation

A self-supervised domain-general learning framework for human ventral stream representation

Get to the Point: Learning Lidar Place Recognition and Metric Localisation Using Overhead Imagery

Do Semantic Parts Emerge in Convolutional Neural Networks?

Contact Info

Product

Resources

About