Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Chemically accurate and comprehensive studies of the virtual space of all possible molecules are severely limited by the computational cost of quantum chemistry. We introduce a composite strategy that adds machine learning corrections to computationally inexpensive approximate legacy quantum methods. After training, highly accurate predictions of enthalpies, free energies, entropies, and electron correlation energies are possible, for significantly larger molecular sets than used for training. For thermochemical properties of up to 16k isomers of C7H10O2 we present numerical evidence that chemical accuracy can be reached. We also predict electron correlation energy in post Hartree-Fock methods, at the computational cost of Hartree-Fock, and we establish a qualitative relationship between molecular entropy and electron correlation. The transferability of our approach is demonstrated, using semiempirical quantum chemistry and machine learning models trained on 1 and 10% of 134k organic molecules, to reproduce enthalpies of all remaining molecules at density functional theory level of accuracy.
As the quantum chemistry (QC) community embraces machine learning (ML), the number of new methods and applications based on the combination of QC and ML is surging. In this Perspective, a view of the current state of affairs in this new and exciting research field is offered, challenges of using machine learning in quantum chemistry applications are described, and potential future developments are outlined. Specifically, examples of how machine learning is used to improve the accuracy and accelerate quantum chemical research are shown. Generalization and classification of existing techniques are provided to ease the navigation in the sea of literature and to guide researchers entering the field. The emphasis of this Perspective is on supervised machine learning.
We present an efficient approach for generating highly accurate molecular potential energy surfaces (PESs) using self-correcting, kernel ridge regression (KRR) based machine learning (ML). We introduce structure-based sampling to automatically assign nuclear configurations from a pre-defined grid to the training and prediction sets, respectively. Accurate high-level ab initio energies are required only for the points in the training set, while the energies for the remaining points are provided by the ML model with negligible computational cost. The proposed sampling procedure is shown to be superior to random sampling and also eliminates the need for training several ML models. Self-correcting machine learning has been implemented such that each additional layer corrects errors from the previous layer. The performance of our approach is demonstrated in a case study on a published high-level ab initio PES of methyl chloride with 44 819 points. The ML model is trained on sets of different sizes and then used to predict the energies for tens of thousands of nuclear configurations within seconds. The resulting datasets are utilized in variational calculations of the vibrational energy levels of CHCl. By using both structure-based sampling and self-correction, the size of the training set can be kept small (e.g., 10% of the points) without any significant loss of accuracy. In ab initio rovibrational spectroscopy, it is thus possible to reduce the number of computationally costly electronic structure calculations through structure-based sampling and self-correcting KRR-based machine learning by up to 90%.
We show that machine learning (ML) can be used to accurately reproduce nonadiabatic excited-state dynamics with decoherence-corrected fewest switches surface hopping in a 1-D model system. We propose to use ML to significantly reduce the simulation time of realistic, high-dimensional systems with good reproduction of observables obtained from reference simulations. Our approach is based on creating approximate ML potentials for each adiabatic state using a small number of training points. We investigate the feasibility of this approach by using adiabatic spin-boson Hamiltonian models of various dimensions as reference methods.
Semiempirical orthogonalization-corrected methods (OM1, OM2, and OM3) go beyond the standard MNDO model by explicitly including additional interactions into the Fock matrix in an approximate manner (Pauli repulsion, penetration effects, and core–valence interactions), which yields systematic improvements both for ground-state and excited-state properties. In this Article, we describe the underlying theoretical formalism of the OMx methods and their implementation in full detail, and we report all relevant OMx parameters for hydrogen, carbon, nitrogen, oxygen, and fluorine. For a standard set of mostly organic molecules commonly used in semiempirical method development, the OMx results are found to be superior to those from standard MNDO-type methods. Parametrized Grimme-type dispersion corrections can be added to OM2 and OM3 energies to provide a realistic treatment of noncovalent interaction energies, as demonstrated for the complexes in the S22 and S66×8 test sets.
In this work we show that deep learning (DL) can be used for exploring complex and highly nonlinear multistate potential energy surfaces of polyatomic molecules and related nonadiabatic dynamics. Our DL is based on deep neural networks (DNNs), which are used as accurate representations of the CASSCF ground-and excited-state potential energy surfaces (PESs) of CH 2 NH. After geometries near conical intersection are included in the training set, the DNN models accurately reproduce excited-state topological structures; photoisomerization paths; and, importantly, conical intersections. We have also demonstrated that the results from nonadiabatic dynamics run with the DNN models are very close to those from the dynamics run with the pure ab initio method. The present work should encourage further studies of using machine learning methods to explore excited-state potential energy surfaces and nonadiabatic dynamics of polyatomic molecules.
The semiempirical orthogonalization-corrected OMx methods (OM1, OM2, and OM3) go beyond the standard MNDO model by including additional interactions in the electronic structure calculation. When augmented with empirical dispersion corrections, the resulting OMx-Dn approaches offer a fast and robust treatment of noncovalent interactions. Here we evaluate the performance of the OMx and OMx-Dn methods for a variety of ground-state properties using a large and diverse collection of benchmark sets from the literature, with a total of 13035 original and derived reference data. Extensive comparisons are made with the results from established semiempirical methods (MNDO, AM1, PM3, PM6, and PM7) that also use the NDDO (neglect of diatomic differential overlap) integral approximation. Statistical evaluations show that the OMx and OMx-Dn methods outperform the other methods for most of the benchmark sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.