Simultaneously
accurate and efficient prediction of molecular properties
throughout chemical compound space is a critical ingredient toward
rational compound design in chemical and pharmaceutical industries.
Aiming toward this goal, we develop and apply a systematic hierarchy
of efficient empirical methods to estimate atomization and total energies
of molecules. These methods range from a simple sum over atoms, to
addition of bond energies, to pairwise interatomic force fields, reaching
to the more sophisticated machine learning approaches that are capable
of describing collective interactions between many atoms or bonds.
In the case of equilibrium molecular geometries, even simple pairwise
force fields demonstrate prediction accuracy comparable to benchmark
energies calculated using density functional theory with hybrid exchange-correlation
functionals; however, accounting for the collective many-body interactions
proves to be essential for approaching the “holy grail”
of chemical accuracy of 1 kcal/mol for both equilibrium and out-of-equilibrium
geometries. This remarkable accuracy is achieved by a vectorized representation
of molecules (so-called Bag of Bonds model) that exhibits strong nonlocality
in chemical space. In addition, the same representation allows us
to predict accurate electronic properties of molecules, such as their
polarizability and molecular frontier orbital energies.
The accurate and reliable prediction of properties of molecules typically requires computationally intensive quantum-chemical calculations. Recently, machine learning techniques applied to ab initio calculations have been proposed as an efficient approach for describing the energies of molecules in their given ground-state structure throughout chemical compound space (Rupp et al. Phys. Rev. Lett. 2012, 108, 058301). In this paper we outline a number of established machine learning techniques and investigate the influence of the molecular representation on the methods performance. The best methods achieve prediction errors of 3 kcal/mol for the atomization energies of a wide variety of molecules. Rationales for this performance improvement are given together with pitfalls and challenges when applying machine learning approaches to the prediction of quantum-mechanical observables.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.