2022
DOI: 10.1038/s41467-022-28091-4
|View full text |Cite
|
Sign up to set email alerts
|

A self-supervised domain-general learning framework for human ventral stream representation

Abstract: Anterior regions of the ventral visual stream encode substantial information about object categories. Are top-down category-level forces critical for arriving at this representation, or can this representation be formed purely through domain-general learning of natural image structure? Here we present a fully self-supervised model which learns to represent individual images, rather than categories, such that views of the same image are embedded nearby in a low-dimensional feature space, distinctly from other r… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
54
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 76 publications
(73 citation statements)
references
References 85 publications
(58 reference statements)
3
54
0
Order By: Relevance
“…Although correlations in the face-selective regions were significantly higher than the non-face-selective regions in both the peak intermediate layer and the final fully-connected layer, correlations with the peak intermediate layer were more than five times stronger than with the final fully-connected layer across face-selective regions. An additional analysis excluded the possibility that the low correlation was due to RSA’s inherent assumption of equal weights or scales for all features comprising the two RDMs (Conwell et al, 2021; Kaniuth and Hebart, 2021; Khaligh-Razavi et al, 2017; Konkle and Alvarez, 2022) (see Material and Methods for details and Figure S4 for more results).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Although correlations in the face-selective regions were significantly higher than the non-face-selective regions in both the peak intermediate layer and the final fully-connected layer, correlations with the peak intermediate layer were more than five times stronger than with the final fully-connected layer across face-selective regions. An additional analysis excluded the possibility that the low correlation was due to RSA’s inherent assumption of equal weights or scales for all features comprising the two RDMs (Conwell et al, 2021; Kaniuth and Hebart, 2021; Khaligh-Razavi et al, 2017; Konkle and Alvarez, 2022) (see Material and Methods for details and Figure S4 for more results).…”
Section: Resultsmentioning
confidence: 99%
“…Because dynamic information plays a major role in the geometry of brain representations (Haxby et al, 2020b;Nastase et al, 2017;Russ and Leopold, 2015), static images could generate higher correlation values between brain responses and DCNNs that do not use motion information (Daube et al, 2021;Grossman et al, 2019;Tsantani et al, 2021). Similarly, studies that used stimuli spanning superordinate categories (e.g., with multiple visual categories (Konkle and Alvarez, 2022;Murty et al, 2021)) would bias representations towards categorical information, reducing the contribution of information that is needed for within-class individuation such as face identification.…”
Section: Discussionmentioning
confidence: 99%
“…One common criticism of task-optimized models is that supervised training on classification tasks is inconsistent with biological learning (45). Recent advances in unsupervised learning have enabled useful representations to be learned from large quantities of natural data without explicit labels, potentially providing a more biologically plausible computational theory of learning (70)(71)(72). A priori it seemed plausible that the invariances of deep neural network models could be strongly dependent on supervised training for classification tasks, in which case models trained without supervision might be substantially more human-like according to the metamers test.…”
Section: Effects Of Unsupervised Trainingmentioning
confidence: 99%
“…Alternatively, the use of more naturalistic “animal-view” movies as input ( Betsch et al, 2004 ), coupled with architectural flexibility, may allow neural networks to learn topographic specializations predictive of patterns in the retina or other hierarchical layers of the visual system ( Doshi and Konkle, 2021 ; Blauch et al, 2022 ). Recent work has shown some progress in this direction, with neural networks explicitly incorporating distinct objectives or constraints that result in the emergence of topographic organization and specializations ( Plaut and Behrmann, 2011 ; Wang and Cottrell, 2017 ; Lee et al, 2020 ; Doshi and Konkle, 2021 ; Zhuang et al, 2021 ; Blauch et al, 2022 ; Konkle and Alvarez, 2022 ), shedding light on the origins of these organizational schemes.…”
Section: Implications For Future Studies Of the Visual Systemmentioning
confidence: 99%