Human observers are capable of tracking multiple objects among identical distractors based only on their spatiotemporal information. Since the first report of this ability in the seminal work of Pylyshyn and Storm (1988, Spatial Vision, 3, 179-197), multiple object tracking has attracted many researchers. A reason for this is that it is commonly argued that the attentional processes studied with the multiple object paradigm apparently match the attentional processing during real-world tasks such as driving or team sports. We argue that multiple object tracking provides a good mean to study the broader topic of continuous and dynamic visual attention. Indeed, several (partially contradicting) theories of attentive tracking have been proposed within the almost 30 years since its first report, and a large body of research has been conducted to test these theories. With regard to the richness and diversity of this literature, the aim of this tutorial review is to provide researchers who are new in the field of multiple object tracking with an overview over the multiple object tracking paradigm, its basic manipulations, as well as links to other paradigms investigating visual attention and working memory. Further, we aim at reviewing current theories of tracking as well as their empirical evidence. Finally, we review the state of the art in the most prominent research fields of multiple object tracking and how this research has helped to understand visual attention in dynamic settings.
Humans understand text and film by mentally representing their contents in situation models. These describe situations using dimensions like time, location, protagonist, and action. Changes in 1 or more dimensions (e.g., a new character enters the scene) cause discontinuities in the story line and are often perceived as boundaries between 2 meaningful units. Recent theoretical advances in event perception led to the assumption that situation models are represented in the form of event models in working memory. These event models are updated at event boundaries. Points in time at which event models are updated are important: Compared with situations during an ongoing event, situations at event boundaries are remembered more precisely and predictions about what happens next become less reliable. We hypothesized that these effects depend on the number of changes in the situation model. In 2 experiments, we had participants watch sitcom episodes and measured recognition memory and prediction performance for event boundaries that contained a change in 1, 2, 3, or 4 dimensions. Results showed a linear relationship: the more dimensions changed, the higher recognition performance was. At the same time, participants' predictions became less reliable with an increasing number of dimension changes. These results suggest that updating of event models at event boundaries occurs incrementally.
Observers can visually track multiple objects that move independently even if the scene containing the moving objects is rotated in a smooth way. Abrupt scene rotations yield tracking more difficult but not impossible. For nonrotated, stable dynamic displays, the strategy of looking at the targets' centroid has been shown to be of importance for visual tracking. But which factors determine successful visual tracking in a nonstable dynamic display? We report two eye tracking experiments that present evidence for centroid looking. Across abrupt viewpoint changes, gaze on the centroid is more stable than gaze on targets indicating a process of realigning targets as a group. Further, we show that the relative importance of centroid looking increases with object speed.Watching and understanding a football game on television requires the ability to keep track of multiple moving objects: For example, in a scene in front of the goal, at least two players (offence and goal-keeper) and the ball have to be tracked. More complex situations (e.g., an off-side position) involve even more players. In contrast to real life scenarios, in television a tactic move (e.g., a counterattack) is often shown in sequential shots from
We examined whether surface feature information is utilized to track the locations of multiple objects. In particular, we tested whether surface features and spatiotemporal information are weighted according to their availability and reliability. Accordingly, we hypothesized that surface features should affect location tracking across spatiotemporal discontinuities. Three kinds of spatiotemporal discontinuities were implemented across five experiments: abrupt scene rotations, abrupt zooms, and a reduced presentation frame rate. Objects were briefly colored across the spatiotemporal discontinuity. Distinct coloring that matched spatiotemporal information across the discontinuity improved tracking performance as compared with homogeneous coloring. Swapping distinct colors across the discontinuity impaired performance. Correspondence by color was further demonstrated by more mis-selected distractors appearing in a former target color than distractors appearing in a former distractor color in the swap condition. This was true even when color never supported tracking and when participants were instructed to ignore color. Furthermore, effects of object color on tracking occurred with unreliable spatiotemporal information but not with reliable spatiotemporal information. Our results demonstrate that surface feature information can be utilized to track the locations of multiple objects. This is in contrast to theories stating that objects are tracked based on spatiotemporal information only. We introduce a flexible-weighting tracking account stating that spatiotemporal information and surface features are both utilized by the location tracking mechanism. The two sources of information are weighted according to their availability and reliability. Surface feature effects on tracking are particularly likely when distinct surface feature information is available and spatiotemporal information is unreliable.
Human long-term memory for visual objects and scenes is tremendous. Here, we test how auditory information contributes to long-term memory performance for realistic scenes. In a total of six experiments, we manipulated the presentation modality (auditory, visual, audio-visual) as well as semantic congruency and temporal synchrony between auditory and visual information of brief filmic clips. Our results show that audio-visual clips generally elicit more accurate memory performance than unimodal clips. This advantage even increases with congruent visual and auditory information. However, violations of audio-visual synchrony hardly have any influence on memory performance. Memory performance remained intact even with a sequential presentation of auditory and visual information, but finally declined when the matching tracks of one scene were presented separately with intervening tracks during learning. With respect to memory performance, our results therefore show that audio-visual integration is sensitive to semantic congruency but remarkably robust against asymmetries between different modalities.
People can keep track of target objects as they move among identical distractors using only spatiotemporal information. We investigated whether or not participants use motion information during the moment-to-moment tracking of objects by adding motion to the texture of moving objects. The texture either remained static or moved relative to the object's direction of motion, either in the same direction, the opposite direction, or orthogonal to each object's trajectory. Results showed that, compared to the static texture condition, tracking performance was worse when the texture moved in the opposite direction of the object and better when the texture moved in the same direction as the object. Our results support the conclusion that motion information is used during the moment-to-moment tracking of objects. Motion information may either affect a representation of position or be used to periodically predict the future location of targets.
The present study combined the approaches of multimedia learning and of comparative visual search (Hardiess, Gillner, & Mallot, 2008) in order to analyse the processing of spatially separated information. Participants were asked to compare two depictions of a mechanical pendulum clock to detect no, one, or two differences between them. The spatial distance between the two depictions was varied, and participants received either stimulus-related information about the functionalities of pendulum clocks or stimulus-unrelated information about the design of cuckoo clocks. The study demonstrates a trade-off between gaze movement and working memory use. We observed fewer gaze shifts with increasing distance between the pictures, suggesting higher working memory use. The findings indicate that the distance between two pictures, domain knowledge and visual working memory span are important factors that determine memory load required for processing split information sources.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.