This study presents a rigorous framework for investigating
molecular
out-of-distribution (MOOD) generalization in drug discovery. The concept
of MOOD is first clarified through a problem specification that demonstrates how the covariate shifts encountered during real-world
deployment can be characterized by the distribution of sample distances
to the training set. We find that these shifts can cause performance
to drop by up to 60% and uncertainty calibration by up to 40%. This
leads us to propose a splitting protocol that aims to close the gap
between the deployment and testing. Then, using this protocol, a thorough investigation is conducted to assess the impact of model
design, model selection, and data set characteristics on MOOD performance
and uncertainty calibration. We find that appropriate representations
and algorithms with built-in uncertainty estimation are crucial to
improving performance and uncertainty calibration. This study sets
itself apart by its exhaustiveness and opens an exciting avenue to
benchmark meaningful algorithmic progress in molecular scoring.