We report a method
to convert discrete representations of molecules
to and from a multidimensional continuous representation. This model
allows us to generate new molecules for efficient exploration and
optimization through open-ended spaces of chemical compounds. A deep
neural network was trained on hundreds of thousands of existing chemical
structures to construct three coupled functions: an encoder, a decoder,
and a predictor. The encoder converts the discrete representation
of a molecule into a real-valued continuous vector, and the decoder
converts these continuous vectors back to discrete molecular representations.
The predictor estimates chemical properties from the latent continuous
vector representation of the molecule. Continuous representations
of molecules allow us to automatically generate novel chemical structures
by performing simple operations in the latent space, such as decoding
random vectors, perturbing known chemical structures, or interpolating
between molecules. Continuous representations also allow the use of
powerful gradient-based optimization to efficiently guide the search
for optimized functional compounds. We demonstrate our method in the
domain of drug-like molecules and also in a set of molecules with
fewer that nine heavy atoms.
Automatic Chemical Design is a framework for generating novel molecules with optimized properties. The original scheme, featuring Bayesian optimization over the latent space of a variational autoencoder, suffers from the pathology that it tends to produce invalid molecular structures. First, we demonstrate empirically that this pathology arises when the Bayesian optimization scheme queries latent points far away from the data on which the variational autoencoder has been trained. Secondly, by reformulating the search procedure as a constrained Bayesian optimization problem, we show that the effects of this pathology can be mitigated, yielding marked improvements in the validity of the generated molecules. We posit that constrained Bayesian optimization is a good approach for solving this class of training set mismatch in many generative tasks involving Bayesian optimization over the latent space of a variational autoencoder.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.