Combining quantum chemistry characterizations
with generative machine
learning models has the potential to accelerate molecular discovery.
In this paradigm, quantum chemistry acts as a relatively cost-effective
oracle for evaluating the properties of particular molecules, while
generative models provide a means of sampling chemical space based
on learned structure–function relationships. For practical
applications, multiple potentially orthogonal properties must be optimized
in tandem during a discovery workflow. This carries additional difficulties
associated with the specificity of the targets and the ability for
the model to reconcile all properties simultaneously. Here, we demonstrate
an active learning approach to improve the performance of multi-target
generative chemical models. We first demonstrate the effectiveness
of a set of baseline models trained on single property prediction
tasks in generating novel compounds (i.e., not present in the training
data) with various property targets, including both interpolative
and extrapolative generation scenarios. For property ranges where
accurate targeting proves difficult, the novel compounds suggested
by the model are characterized using quantum chemistry and the new
molecules closest to expressing the desired properties are fed back
into the generative model for additional training. This gradually
improves the generative models’ understanding of targeted areas
of chemical space and shifts the distribution of the generated compounds
toward the targeted values. We then demonstrate the effectiveness
of this active learning approach in generating compounds with multiple
chemical constraints, including vertical ionization potential, electron
affinity, and dipole moment targets, and validate the results at the
ωB97X-D3/def2-TZVP level. This method requires no modifications
to extant generative approaches, but rather utilizes their inherent
generative and predictive aspects for self-refinement, and can be
applied to situations where any number of properties with varying
degrees of correlation must be optimized simultaneously.