Artificial intelligence (AI) promises to take the flawed intelligence of humans out of machines. Why, then, might we want to put the inchoate intelligence of human infants into machines? While infants seem to intuit others’ underlying intentions merely by observing their actions, AI systems, in contrast, fall short in such commonsense psychology. Here we put infant and machine intelligence into direct dialogue through their performance on the Baby Intuitions Benchmark (BIB), a comprehensive suite of tasks probing commonsense psychology. Following a preregistered design and analysis plan, we collected 288 individual responses of 11-month-old infants to BIB’s six tasks and tested three state-of-the-art learning-driven neural-network models from two different model classes. Infants’ performance revealed their comprehensive understanding of agents as rational and goal-directed, but the models failed to capture infants’ knowledge. These striking differences between human and artificial intelligence are critical to address to build machine common sense.
How we perceive the physical world is not only organized in terms of objects, but also structured in time as sequences of events. This is especially evident in intuitive physics, with temporally bounded dynamics such as falling, occlusion, and bouncing demarcating the continuous flow of sensory inputs. While the spatial structure and attentional consequences of physical objects have been well-studied, much less is known about the temporal structure and attentional consequences of physical events in visual perception. Previous work has recognized physical events as units in the mind, and used pre-segmented object interactions to explore physical representations. However, these studies did not address whether and how perception imposes the kind of temporal structure that carves these physical events to begin with, and the attentional consequences of such segmentation during intuitive physics. Here, we use performance-based tasks to address this gap. In Experiment 1, we find that perception not only spontaneously separates visual input in time into physical events, but also, this segmentation occurs in a nonlinear manner within a few hundred milliseconds at the moment of the event boundary. In Experiment 2, we find that event representations, once formed, use coarse 'look ahead' simulations to selectively prioritize those objects that are predictively part of the unfolding dynamics. This rich temporal and predictive structure of physical events, formed during vision, should inform models of intuitive physics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.