Simultaneously
accurate and efficient prediction of molecular properties
throughout chemical compound space is a critical ingredient toward
rational compound design in chemical and pharmaceutical industries.
Aiming toward this goal, we develop and apply a systematic hierarchy
of efficient empirical methods to estimate atomization and total energies
of molecules. These methods range from a simple sum over atoms, to
addition of bond energies, to pairwise interatomic force fields, reaching
to the more sophisticated machine learning approaches that are capable
of describing collective interactions between many atoms or bonds.
In the case of equilibrium molecular geometries, even simple pairwise
force fields demonstrate prediction accuracy comparable to benchmark
energies calculated using density functional theory with hybrid exchange-correlation
functionals; however, accounting for the collective many-body interactions
proves to be essential for approaching the “holy grail”
of chemical accuracy of 1 kcal/mol for both equilibrium and out-of-equilibrium
geometries. This remarkable accuracy is achieved by a vectorized representation
of molecules (so-called Bag of Bonds model) that exhibits strong nonlocality
in chemical space. In addition, the same representation allows us
to predict accurate electronic properties of molecules, such as their
polarizability and molecular frontier orbital energies.
Machine learning (ML) based prediction of molecular properties across chemical compound space is an important and alternative approach to efficiently estimate the solutions of highly complex many-electron problems in chemistry and physics. Statistical methods represent molecules as descriptors that should encode molecular symmetries and interactions between atoms. Many such descriptors have been proposed; all of them have advantages and limitations. Here, we propose a set of general two-body and three-body interaction descriptors which are invariant to translation, rotation, and atomic indexing. By adapting the successfully used kernel ridge regression methods of machine learning, we evaluate our descriptors on predicting several properties of small organic molecules calculated using density-functional theory. We use two data sets. The GDB-7 set contains 6868 molecules with up to 7 heavy atoms of type CNO. The GDB-9 set is composed of 131722 molecules with up to 9 heavy atoms containing CNO. When trained on 5000 random molecules, our best model achieves an accuracy of 0.8 kcal/mol (on the remaining 1868 molecules of GDB-7) and 1.5 kcal/mol (on the remaining 126722 molecules of GDB-9) respectively. Applying a linear regression model on our novel many-body descriptors performs almost equal to a nonlinear kernelized model. Linear models are readily interpretable: a feature importance ranking measure helps to obtain qualitative and quantitative insights on the importance of two- and three-body molecular interactions for predicting molecular properties computed with quantum-mechanical methods.
Abstract. Machine learning has been successfully applied to the prediction of chemical properties of small organic molecules such as energies or polarizabilities. Compared to these properties, the electronic excitation energies pose a much more challenging learning problem. Here, we examine the applicability of two existing machine learning methodologies to the prediction of excitation energies from time-dependent density functional theory. To this end, we systematically study the performance of various 2-and 3-body descriptors as well as the deep neural network SchNet to predict extensive as well as intensive properties such as the transition energies from the ground state to the first and second excited state. As perhaps expected current state-of-the-art machine learning techniques are more suited to predict extensive as opposed to intensive quantities. We speculate on the need to develop global descriptors that can describe both extensive and intensive properties on equal footing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.