We propose a definition of saliency by considering what the visual system is trying to optimize when directing attention. The resulting model is a Bayesian framework from which bottom-up saliency emerges naturally as the self-information of visual features, and overall saliency (incorporating top-down information with bottom-up saliency) emerges as the pointwise mutual information between the features and the target when searching for a target. An implementation of our framework demonstrates that our model's bottom-up saliency maps perform as well as or better than existing algorithms in predicting people's fixations in free viewing. Unlike existing saliency measures, which depend on the statistics of the particular image being viewed, our measure of saliency is derived from natural image statistics, obtained in advance from a collection of natural images. For this reason, we call our model SUN (Saliency Using Natural statistics). A measure of saliency based on natural image statistics, rather than based on a single test image, provides a straightforward explanation for many search asymmetries observed in humans; the statistics of a single test image lead to predictions that are not consistent with these asymmetries. In our model, saliency is computed locally, which is consistent with the neuroanatomy of the early visual system and results in an efficient algorithm with few free parameters.
When people try to find particular objects in natural scenes they make extensive use of knowledge about how and where objects tend to appear in a scene. Although many forms of such "top-down" knowledge have been incorporated into saliency map models of visual search, surprisingly, the role of object appearance has been infrequently investigated. Here we present an appearance-based saliency model derived in a Bayesian framework. We compare our approach with both bottom-up saliency algorithms as well as the state-of-the-art Contextual Guidance model of Torralba et al. (2006) at predicting human fixations. Although both top-down approaches use very different types of information, they achieve similar performance; each substantially better than the purely bottomup models. Our experiments reveal that a simple model of object appearance can predict human fixations quite well, even making the same mistakes as people.
We developed a rich dataset of Chest X-Ray (CXR) images to assist investigators in artificial intelligence. The data were collected using an eye-tracking system while a radiologist reviewed and reported on 1,083 CXR images. The dataset contains the following aligned data: CXR image, transcribed radiology report text, radiologist’s dictation audio and eye gaze coordinates data. We hope this dataset can contribute to various areas of research particularly towards explainable and multimodal deep learning/machine learning methods. Furthermore, investigators in disease classification and localization, automated radiology report generation, and human-machine interaction can benefit from these data. We report deep learning experiments that utilize the attention maps produced by the eye gaze dataset to show the potential utility of this dataset.
The role of memory in guiding attention allocation in daily behaviors is not well understood. In experiments with two-dimensional (2D) images, there is mixed evidence about the importance of memory. Because the stimulus context in laboratory experiments and daily behaviors differs extensively, we investigated the role of memory in visual search, in both two-dimensional (2D) and three-dimensional (3D) environments. A 3D immersive virtual apartment composed of two rooms was created, and a parallel 2D visual search experiment composed of snapshots from the 3D environment was developed. Eye movements were tracked in both experiments. Repeated searches for geometric objects were performed to assess the role of spatial memory. Subsequently, subjects searched for realistic context objects to test for incidental learning. Our results show that subjects learned the room-target associations in 3D but less so in 2D. Gaze was increasingly restricted to relevant regions of the room with experience in both settings. Search for local contextual objects, however, was not facilitated by early experience. Incidental fixations to context objects do not necessarily benefit search performance. Together, these results demonstrate that memory for global aspects of the environment guides search by restricting allocation of attention to likely regions, whereas task relevance determines what is learned from the active search experience. Behaviors in 2D and 3D environments are comparable, although there is greater use of memory in 3D.
While it is universally acknowledged that both bottom up and top down factors contribute to allocation of gaze, we currently have limited understanding of how top-down factors determine gaze choices in the context of ongoing natural behavior. One purely top-down model by Sprague, Ballard, and Robinson (2007) suggests that natural behaviors can be understood in terms of simple component behaviors, or modules, that are executed according to their reward value, with gaze targets chosen in order to reduce uncertainty about the particular world state needed to execute those behaviors. We explore the plausibility of the central claims of this approach in the context of a task where subjects walk through a virtual environment performing interceptions, avoidance, and path following. Many aspects of both walking direction choices and gaze allocation are consistent with this approach. Subjects use gaze to reduce uncertainty for task-relevant information that is used to inform action choices. Notably the addition of motion to peripheral objects did not affect fixations when the objects were irrelevant to the task, suggesting that stimulus saliency was not a major factor in gaze allocation. The modular approach of independent component behaviors is consistent with the main aspects of performance, but there were a number of deviations suggesting that modules interact. Thus the model forms a useful, but incomplete, starting point for understanding top-down factors in active behavior.
What is the role of the Fusiform Face Area (FFA)? Is it specific to face processing, or is it a visual expertise area? The expertise hypothesis is appealing due to a number of studies 15showing that the FFA is activated by pictures of objects within the subject's domain of 16 expertise (e.g., cars for car experts, birds for birders, etc.), and that activation of the FFA 17 increases as new expertise is acquired in the lab. However, it is incumbent upon the 18 proponents of the expertise hypothesis to explain how it is that an area that is initially 19 specialized for faces becomes recruited for new classes of stimuli. We dub this the "visual 20 expertise mystery." One suggested answer to this mystery is that the FFA is used simply 21 because it is a fine discrimination area, but this account has historically lacked a mechanism 22 describing exactly how the FFA would be recruited for novel domains of expertise. (De Renzi et al., 1994).
We have been developing techniques for extracting general world knowledge from miscellaneous texts by a process of approximate interpretation and abstraction, focusing initially on the Brown corpus. We apply interpretive rules to clausal patterns and patterns of modification, and concurrently abstract general "possibilistic" propositions from the resulting formulas. Two examples are "A person may believe a proposition", and "Children may live with relatives". Our methods currently yield over 117,000 such propositions (of variable quality) for the Brown corpus (more than 2 per sentence). We report here on our efforts to evaluate these results with a judging scheme aimed at determining how many of these propositions pass muster as "reasonable general claims" about the world in the opinion of human judges. We find that nearly 60% of the extracted propositions are favorably judged according to our scheme by any given judge. The percentage unanimously judged to be reasonable claims by multiple judges is lower, but still sufficiently high to suggest that our techniques may be of some use in tackling the long-standing "knowledge acquisition bottleneck" in AI.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.