Understanding how goal states control behavior is a question ripe for interrogation by new methods from machine learning. These methods require large and labeled datasets to train models. To annotate a large-scale image dataset with observed search fixations, we collected 16,184 fixations from people searching for either microwaves or clocks in a dataset of 4,366 images (MS-COCO). We then used this behaviorally-annotated dataset and the machine learning method of Inverse-Reinforcement Learning (IRL) to learn target-specific reward functions and policies for these two target goals. Finally, we used these learned policies to predict the fixations of 60 new behavioral searchers (clock = 30, microwave = 30) in a disjoint test dataset of kitchen scenes depicting both a microwave and a clock (thus controlling for differences in low-level image contrast). We found that the IRL model predicted behavioral search efficiency and fixation-density maps using multiple metrics. Moreover, reward maps from the IRL model revealed target-specific patterns that suggest, not just attention guidance by target features, but also guidance by scene context (e.g., fixations along walls in the search of clocks). Using machine learning and the psychologically-meaningful principle of reward, it is possible to learn the visual features used in goal-directed attention control.
Inverse-Reinforcement LearningIRL is an imitation-learning method from the machine-learning literature that learns, through observations of an expert, a reward function and policy for mimicking expert performance. We extend this framework to goal-directed behavior by assuming that the image locations fixated by searchers constitute the expert performance that the model learns to mimic. The specific IRL algorithm that we use is Generative Adversarial Imitation Learning (GAIL 10 ), which makes reward proportional to the model's ability to generate State-Action pairings that imitate observed State-Action pairings. Here, the Action is a shift of fixation location in a search image (the model's saccade), and the State is the search context (all the information available for use in the search task). The State includes, but is not limited to, the visual features extracted from an image and the learned visual