Semantic Understanding of Scenes Through the ADE20K Dataset

Zhou, Bolei; Zhao, Hang; Puig, Xavier; Xiao, Tete; Fidler, Sanja; Barriuso, Adela; Torralba, Antonio

doi:10.1007/s11263-018-1140-0

Cited by 1,094 publications

(837 citation statements)

References 35 publications

Supporting

Mentioning

752

Contrasting

Unclassified

Order By: Relevance

“…To accomplish this, we needed a large dataset of natural images in which all object occurrences were labeled. We took advantage of the recently created ADE20K database, which contains 22,210 annotated scenes in which every object has been manually labeled by an expert human annotator 25 . One approach for characterizing the co-occurrence statistics of this dataset would be to simply construct a matrix of co-occurrence frequencies for all pairwise comparisons of objects.…”

Section: Object Embeddingsmentioning

confidence: 99%

“…In the field of computational linguistics, there is a long history of modeling word co-occurrence data in language corpora with dense, lower dimensional representations 26 . This modeling framework, known as distributional semantics, has proved highly useful because 25 , which contains 22,210 images in which every pixel is associated with an object label provided by an expert human annotator. An adaptation of the word2vec machine-learning algorithm for distributional semantics-which we call object2vec-was applied to this corpus of image annotations to model the statistical regularities of object-label cooccurrence in a large sample of real-world scenes.…”

Section: Object Embeddingsmentioning

confidence: 99%

“…Our work directly addresses this issue by explicitly modeling the statistics of object co-occurrence in the visual environment and relating these statistics to object-evoked fMRI responses. We overcame the challenge of objectively quantifying the real-world statistics of object context by leveraging a large dataset of densely labeled images that was originally developed for training semantic segmentation models in computer vision 25 . We analyzed the annotations for these images using object2vec, which is a modified version of the word2vec algorithm from computational linguistics 23,24 .…”

Section: Contextual Representations In Scene Regionsmentioning

confidence: 99%

“…This dataset contains 22,210 scene images that have been densely annotated by human observers such that every object within every scene is segmented and labelled. Across the entire scene image set, there are 3,148 unique object segmentation labels 25 . ADE20K is ideal for our purposes because it provides a large and diverse set of image annotations that we can use to model the co-occurrence statistics of objects in natural images.…”

Section: Subjectsmentioning

confidence: 99%

See 3 more Smart Citations

Object representations in the human brain reflect the co-occurrence statistics of vision and language

Bonner

Epstein

2020

Preprint

View full text Add to dashboard Cite

A central regularity of visual perception is the co-occurrence of objects in the natural environment. Here we use machine learning and fMRI to test the hypothesis that object cooccurrence statistics are encoded in the human visual system and elicited by the perception of individual objects. We identified low-dimensional representations that capture the latent statistical structure of object co-occurrence in real-world scenes, and we mapped these statistical representations onto voxelwise fMRI responses during object viewing. We found that cortical responses to single objects were predicted by the statistical ensembles in which they typically occur, and that this link between objects and their visual contexts was made most strongly in the anterior portion of the scene-selective parahippocampal place area. In contrast, a languagebased statistical model of the co-occurrence of object names in written text predicted responses in neighboring regions of object-selective visual cortex. Together, these findings show that the sensory coding of objects in the human brain reflects the latent statistics of object context in visual and linguistic experience.

show abstract

Section: Object Embeddingsmentioning

confidence: 99%

Section: Object Embeddingsmentioning

confidence: 99%

Section: Contextual Representations In Scene Regionsmentioning

confidence: 99%

Section: Subjectsmentioning

confidence: 99%

See 2 more Smart Citations

Object representations in the human brain reflect the co-occurrence statistics of vision and language

Bonner

Epstein

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…134 Therefore, in this analysis, we investigate the case if units corresponding to the free 135 space show a higher correlation with the behavior and brain RDMs than the readout 136 layer of the VGG scene-parse . The readout layer of the VGG scene-parse consists of 151 137 channels with 150 channel each containing an output corresponding to a particular class 138 in the ADE20k [26] dataset and 1 channel corresponding to the background. Therefore, 139 it is straightforward to separate specific category activation from the readout layer.…”

mentioning

confidence: 99%

Task-specific vision models explain task-specific areas of visual cortex

Dwivedi

Roig

2018

Preprint

View full text Add to dashboard Cite

Computational models such as deep neural networks (DNN) trained for classification are often used to explain responses of the visual cortex. However, not all the areas of the visual cortex are involved in object/scene classification. For instance, scene selective occipital place area (OPA) plays a role in mapping navigational affordances. Therefore, for explaining responses of such task-specific brain area, we investigate if a model that performs a related task can serve as a better computational model than a model that performs an unrelated task. We found that DNN trained on a task (scene-parsing) related to the function (navigational affordances) of a brain region (OPA) explains its responses better than a DNN trained on a task (scene-classification) which is not explicitly related. In a subsequent analysis, we found that the DNNs that showed high correlation with a particular brain region were trained on a task that was consistent with functions of that brain region reported in previous neuroimaging studies. Our results demonstrate that the task is paramount for selecting a computational model of a brain area. Further, explaining the responses of a brain area by a diverse set of tasks has the potential to shed some light on its functions. Author summaryAreas in the human visual cortex are specialized for specific behaviors either due to supervision and interaction with the world or due to evolution. A standard way to gain insight into the function of these brain region is to design experiments related to a particular behavior, and localize the regions showing significant relative activity corresponding to that behavior. In this work, we investigate if we can figure out the function of a brain area in visual cortex using computational vision models. From our results, we find that explaining responses of a brain region using DNNs trained on a diverse set of possible vision tasks can help us gain insights into its function. The consistency of our results using DNNs with the previous neuroimaging studies suggest that the brain region may be specialized for behavior similar to the tasks for which DNNs showed a high correlation with its responses. 2 cortical responses in the visual cortex [1-11]. DNNs trained on a large dataset of images 3 for the object classification task have been shown to explain the human and monkey 4 cortical responses in the inferior temporal cortex (IT) area known for playing a role in 5 August 21, 2018 1/17 object recognition. Further, in Razavi and Kriegsorte [4], it has been revealed that 6unsupervised models are unable to explain the IT responses as well as the models 7 supervised for the classification task, thus emphasizing that supervision for classification 8 task is crucial for explaining IT responses. 9In some recent works [12,13], they show that the fMRI responses in the 10 scene-selective brain regions like OPA and parahippocampal place area (PPA) are 11 correlated with a DNN trained for the classification task. However, in earlier work from 12 Bonner and Epstein [14], they showe...

show abstract

VISEL: A visual and magnetic fusion‐based large‐scale indoor localization system with improved high‐precision semantic maps

et al. 2022

Int J of Intelligent Sys

View full text Add to dashboard Cite

show abstract

Semantic Understanding of Scenes Through the ADE20K Dataset

Cited by 1,094 publications

References 35 publications

Object representations in the human brain reflect the co-occurrence statistics of vision and language

Object representations in the human brain reflect the co-occurrence statistics of vision and language

Task-specific vision models explain task-specific areas of visual cortex

VISEL: A visual and magnetic fusion‐based large‐scale indoor localization system with improved high‐precision semantic maps

Contact Info

Product

Resources

About