This paper presents a spatio-temporal saliency model that predicts eye movement during video free viewing. This model is inspired by the biology of the first steps of the human visual system. The model extracts two signals from video stream corresponding to the two main outputs of the retina: parvocellular and magnocellular. Then, both signals are split into elementary feature maps by cortical-like filters. These feature maps are used to form two saliency maps: a static and a dynamic one. These maps are then fused into a spatio-temporal saliency map. The model is evaluated by comparing the salient areas of each frame predicted by the spatio-temporal saliency map to the eye positions of different subjects during a free video viewing experiment with a large database (17000 frames). In parallel, the static and the dynamic pathways are analyzed to understand what is more or less salient and for what type of videos our model is a good or a poor predictor of eye movement.
Conversation scenes are a typical example in which classical models of visual attention dramatically fail to predict eye positions. Indeed, these models rarely consider faces as particular gaze attractors and never take into account the important auditory information that always accompanies dynamic social scenes. We recorded the eye movements of participants viewing dynamic conversations taking place in various contexts. Conversations were seen either with their original soundtracks or with unrelated soundtracks (unrelated speech and abrupt or continuous natural sounds). First, we analyze how auditory conditions influence the eye movement parameters of participants. Then, we model the probability distribution of eye positions across each video frame with a statistical method (Expectation-Maximization), allowing the relative contribution of different visual features such as static low-level visual saliency (based on luminance contrast), dynamic low level visual saliency (based on motion amplitude), faces, and center bias to be quantified. Through experimental and modeling results, we show that regardless of the auditory condition, participants look more at faces, and especially at talking faces. Hearing the original soundtrack makes participants follow the speech turn-taking more closely. However, we do not find any difference between the different types of unrelated soundtracks. These eyetracking results are confirmed by our model that shows that faces, and particularly talking faces, are the features that best explain the gazes recorded, especially in the original soundtrack condition. Low-level saliency is not a relevant feature to explain eye positions made on social scenes, even dynamic ones. Finally, we propose groundwork for an audiovisual saliency model.
Current models of visual perception suggest that during scene categorization, low spatial frequencies (LSF) are processed rapidly and activate plausible interpretations of visual input. This coarse analysis would then be used to guide subsequent processing of high spatial frequencies (HSF). The present fMRI study examined how processing of LSF may influence that of HSF by investigating the neural bases of the semantic interference effect. We used hybrid scenes as stimuli by combining LSF and HSF from two different scenes, and participants had to categorize the HSF scene. Categorization was impaired when LSF and HSF scenes were semantically dissimilar, suggesting that the LSF scene was processed automatically and interfered with categorization of the HSF scene. fMRI results revealed that this semantic interference effect was associated with increased activation in the inferior frontal gyrus, the superior parietal lobules, and the fusiform and parahippocampal gyri. Furthermore, a connectivity analysis (psychophysiological interaction) revealed that the semantic interference effect resulted in increasing connectivity between the right fusiform and the right inferior frontal gyri. Results support influential models suggesting that, during scene categorization, LSF information is processed rapidly in the pFC and activates plausible interpretations of the scene category. These coarse predictions would then initiate top-down influences on recognition-related areas of the inferotemporal cortex, and these could interfere with the categorization of HSF information in case of semantic dissimilarity to LSF.
The P300 event-related potential has been extensively studied in electroencephalography with classical paradigms that force observers to not move their eyes. This potential is classically used to infer whether a target or a task-relevant stimulus was presented. Few researches have studied this potential through more ecological paradigms where observers were able to move their eyes. In this study, we examined with an ecological paradigm and an adapted methodology the P300 potential using a visual search task that involves eye movements to actively explore natural scenes and during which eye movements and electroencephalographic activity were coregistered. Averaging the electroencephalography signal time-locked to fixation onsets, a P300 potential was observed for fixations onto the target object but not for other fixations recorded for the same visual search or for fixations recorded during the free viewing without any task. Our approach consists of using control experimental conditions with similar eye movements to ensure that the P300 potential was attributable to the fact that the observer gazed at the target rather than to other factors such as eye movement pattern (the size of the previous saccade) or the "overlap issue" between the potentials elicited by two successive fixations. We also proposed to model the time overlap issue of the potentials elicited by consecutive fixations with various durations. Our results show that the P300 potential can be studied in ecological situations without any constraint on the type of visual exploration, with some precautions in the interpretation of results due to the overlap issue.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.