An important step in the process of understanding visual scenes is its organization in different perceptual objects which requires figure-ground segregation. The determination which side of an occlusion boundary is figure (closer to the observer) and which is ground (further away from the observer) is made through a combination of global cues, like convexity, and local cues, like T-junctions. We here focus on a novel set of local cues in the intensity patterns along occlusion boundaries which we show to differ between figure and ground. Image patches are extracted from natural scenes from two standard image sets along the boundaries of objects and spectral analysis is performed separately on figure and ground. On the figure side, oriented spectral power orthogonal to the occlusion boundary significantly exceeds that parallel to the boundary. This “spectral anisotropy” is present only for higher spatial frequencies, and absent on the ground side. The difference in spectral anisotropy between the two sides of an occlusion border predicts which is the figure and which the background with an accuracy exceeding 60% per patch. Spectral anisotropy of close-by locations along the boundary co-varies but is largely independent over larger distances which allows to combine results from different image regions. Given the low cost of this strictly local computation, we propose that spectral anisotropy along occlusion boundaries is a valuable cue for figure-ground segregation. A data base of images and extracted patches labeled for figure and ground is made freely available.
In this paper we provide an overview of audio visual saliency map models. In the simplest model, the location of auditory source is modeled as a Gaussian and use different methods of combining the auditory and visual information. We then provide experimental results with applications of simple audio-visual integration models for cognitive scene analysis. We validate the simple audio-visual saliency models with a hardware convolutional network architecture and real data recorded from moving audio-visual objects. The latter system was developed under Torch language by extending the attention.lua (code) and attention.ui (GUI) files that implement Culurciello's visual attention model. I. I NTRODUCTIONScientists and engineers have traditionally separated the analysis of a multisensory scene into its constituent sensory domains. In this approach, for example, all auditory events are processed separately and independently of visual and somatosensory streams even though the same multisensory event may give rise to those constituent streams. It was previously necessary to compartmentalize the analysis because of the sheer enonnity of information as well as the limita tions of experimental techniques and computational resources. With recent advances in science and technology, it is now possible to perform integrated analysis of sensory systems including interactions within and across sensory modalities. Such efforts are becoming increasingly common in cellular neurophysiology, imaging and psychophysics studies [1], [2]. A better understanding of interaction, information integration, and complementarity of information across senses may help us build many intelligent algorithms for object detection, object recognition, human activity and gait detection, surveillance, tracking, biometrics etc, with better performance, stability and robustness to noise. For example, fusing auditory (voice) and visual (face) features can help improve the perfonnance of speaker identification and face recognition systems [3], [4].There are several examples of highly successful neuromor phic engineering systems [5], [6] that mimic the function of individual sensory systems. However, the efforts have so far been limited to modeling only individual sensory systems rather than the interaction between them. Our goal in this work is to build computational models of multisensory processing to analyze real world perceptual scenes. We limit our focus to two important sensory systems: the visual and auditory systems. Our work is divided into two parts, one being computational verification and the other being hardware implementation. We investigate the nature of multisensory interaction between the auditory and visual domains. More specifically, we consider the effect of a spatially co-occurring auditory stimulus on the salience of an inconspicuous visual target at the same spatial location among other visual distractors. Temporal concurrency is assumed between visual and auditory events. The motiva tion for this work is that audio-visual integration is hi...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.