We present a meta-learning framework for learning new visual concepts quickly, from just one or a few examples, guided by multiple naturally occurring data streams: simultaneously looking at images, reading sentences that describe the objects in the scene, and interpreting supplemental sentences that relate the novel concept with other concepts. The learned concepts support downstream applications, such as answering questions by reasoning about unseen images. Our model, namely FALCON, represents individual visual concepts, such as colors and shapes, as axis-aligned boxes in a high-dimensional space (the "box embedding space"). Given an input image and its paired sentence, our model first resolves the referential expression in the sentence and associates the novel concept with particular objects in the scene. Next, our model interprets supplemental sentences to relate the novel concept with other known concepts, such as "X has property Y" or "X is a kind of Y". Finally, it infers an optimal box embedding for the novel concept that jointly 1) maximizes the likelihood of the observed instances in the image, and 2) satisfies the relationships between the novel concepts and the known ones. We demonstrate the effectiveness of our model on both synthetic and real-world datasets.
Why do babies look longer when they see an object pass through a solid wall, or a person act inefficiently, during violation-of-expectation (VOE) studies? Here we test two non-mutually exclusive hypotheses: (i) VOE involves domain-general processes, like visual prediction error, and curiosity about the source of surprise. (ii) VOE involves domain-specific processes, like prediction error over distinctively physical and psychological expectations (objects fall; agents behave rationally). In a pre-registered experiment, we scanned 32 adults using functional magnetic resonance imaging (fMRI) while they watched videos of agents and objects, adapted from infant behavioral research. Early visual regions responded equally to surprising and expected events in both domains, providing evidence against domain-general visual prediction error. Some multiple demand regions, that are engaged when people deploy goal-directed attention, responded more to surprising events from both domains, providing evidence for domain-general endogenous attention. Domain-specific regions, that prefer stimuli involving agents vs objects more broadly, showed similar preferences for the current videos of agents and objects. One region implicated in physical reasoning responded selectively to unexpected events from the physical domain, providing evidence for domain-specific physical prediction error. Thus, in adult brains, both domain-specific and high-level domain-general regions encode violation-of-expectation involving agents and objects, paving the way towards future developmental work.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.