Contextual associations represented both in neural networks and human behavior

Aminoff, Elissa; Baror, Shira; Roginek, Eric; Leeds, Daniel

doi:10.1101/2022.01.13.476195

Cited by 1 publication

(6 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These results extend well-established demonstrations that deep neural networks represent broad characteristics of visual object recognition (e.g. Aminoff et al, 2022;Geirhos et al, 2021;Kubilius et al, 2016;Lee & Almeida, 2021;Mukherjee & Rogers, 2023;Tuli et al, 2021;Xu & Vaziri-Pashkam et al 2021;Zeman et al, 2020;Zhou et al, 2022).…”

Section: Discussionsupporting

confidence: 89%

“…Strong performance for vision dimensions is relatively unsurprising given the vast exposure to image information during pre-training. Notably, functional information about objects is also, to some extent, accessible via visual information: This is shown to some extent with simpler pre-training schemes for CNNs that can learn contextual associations between objects (Aminoff et al, 2022;. Beyond visual information, the linguistic contributions of CLIP likely contribute to the improved approximation of functional information relative to vision-only models (see fig.…”

Section: Discussionmentioning

confidence: 99%

“…Beyond the exposure to a vast training dataset that may facilitate the learning of statistical regularities, the combination of visual and linguistic information employed by CLIP allows for the acquisition of deeper semantic representations: This likely accounts for recent demonstrations that CLIP outperforms comparison networks in the prediction of neural object responses (Contier et al, 2023;Wang et al, 2023;Zhou et al, 2022). Accordingly, CLIP is potentially well-suited to learn relatively subtle information about objects -such as manipulable object dimensions -beyond simpler semantic distinctions already shown by neural networks (Aminoff et al 2022;.…”

Section: Introductionmentioning

confidence: 88%

“…However, neural networks may still acquire some aspects of human object knowledge. Studies that probe the representational contents of pre-trained neural networks reveal evidence of "incidental learning" of information that is not explicitly intended: For instance, convolutional neural networks (CNNs) trained on an a simple image classification task can learn certain semantic associations -for example, image representations for bicycles are more similar to helmets than semantically unrelated objects, such as forks (Aminoff et al, 2022), and fish are more similar to underwater scenes than land scenes .…”

Section: Introductionmentioning

confidence: 99%

“…Modeling human-to-neural-network correspondence in this way may have several advantages over representational similarity analysis (RSA) -a common approach for human-to-neural-network comparisons (e.g. Aminoff et al, 2022;Xu & Vaziri-Pashkam et al 2021;Zeman et al, 2020). First, RSA measures the overall similarity of two measures or systems by comparing their multi-dimensional representational spaces, but does not consider the constituent dimensions that structure these spaces: The potential interpretational value of these dimensions is overlooked -for example, understanding that human-network correspondence is driven by a dimension consistent with "elongation" is more informative than broadly construed "visual properties".…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Fine-grained knowledge about manipulable objects is well-predicted by CLIP

Walbrin,

Sossounov,

Mahdiani

et al. 2024

Preprint

View full text Add to dashboard Cite

Object recognition is an important human ability that relies on distinguishing between similar objects, for example, deciding which kitchen utensil(s) to use at different stages of meal preparation. Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for example, vision, manipulation, and function-based object properties. A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning. Here, we show that behavioral dimensions are well-predicted by a state-of-the-art multimodal network trained on a large and diverse set of image-text pairs - CLIP-ViT - and for the most part, can also generate good predictions of human behavior for previously unseen objects. Moreover, this model vastly outperforms comparison networks pre-trained with smaller, image-only training datasets. These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained human object knowledge. We discuss the possible sources of this benefit relative to other tested models (e.g. multimodal image-text pre-training vs. image only pre-training, dataset size, architecture).

show abstract

Section: Discussionsupporting

confidence: 89%

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 88%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Fine-grained knowledge about manipulable objects is well-predicted by CLIP

Walbrin,

Sossounov,

Mahdiani

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Contextual associations represented both in neural networks and human behavior

Cited by 1 publication

References 25 publications

Fine-grained knowledge about manipulable objects is well-predicted by CLIP

Fine-grained knowledge about manipulable objects is well-predicted by CLIP

Contact Info

Product

Resources

About