Balancing exploration and exploitation is a fundamental aspect of decision-making. It remains unknown whether people are close to optimal in striking this balance, and if not, how exactly their behavior deviates from optimality. Many existing paradigms are not ideally suited to answer this question, as they contain complexities such as non-stationary environments, stochasticity under exploitation, and reward distributions that are unknown to participants. Here, we introduce a task without such complexities, in which the optimal policy is to start off exploring and to switch to exploitation at most once in each sequence of decisions. The behavior of 49 laboratory and 143 online participants deviated both qualitatively and quantitatively from the optimal policy, even when allowing for bias and decision noise. Instead, people seem to follow a suboptimal rule in which they switch from exploration to exploitation when the highest reward so far exceeds a certain threshold. Moreover, we show that this threshold decreases approximately linearly with the proportion of the sequence that remains, suggesting a novel temporal ratio law. Finally, we find evidence for "sequencelevel" variability which is shared across all decisions in the same sequence. Our results provide a new perspective on the explore-exploit dilemma, and emphasize the importance of examining sequencelevel strategies and their variability when studying sequential decision-making..
Balancing exploration and exploitation is a fundamental aspect of decision-making. It remains unknown whether people are close to optimal in striking this balance, and if not, how exactly their behavior deviates from optimality. Many existing paradigms are not ideally suited to answer this question, as they contain complexities such as non-stationary environments, stochasticity under exploitation, and reward distributions that are unknown to participants. Here, we introduce a task without such complexities, in which the optimal policy is to start off exploring and to switch to exploitation at most once in each sequence of decisions. The behavior of 49 laboratory and 143 online participants deviated both qualitatively and quantitatively from the optimal policy, even when allowing for bias and decision noise. Instead, people seem to follow a suboptimal rule in which they switch from exploration to exploitation when the highest reward so far exceeds a certain threshold. Moreover, we show that this threshold decreases approximately linearly with the proportion of the sequence that remains, suggesting a novel temporal ratio law. Finally, we find evidence for "sequencelevel" variability which is shared across all decisions in the same sequence. Our results provide a new perspective on the explore-exploit dilemma, and emphasize the importance of examining sequencelevel strategies and their variability when studying sequential decision-making.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.