The training of molecular models of quantum mechanical properties based on statistical machine learning requires large datasets which exemplify the map from chemical structure to molecular property. Intelligent a priori selection of training examples is often difficult or impossible to achieve as prior knowledge may be sparse or unavailable. Ordinarily representative selection of training molecules from such datasets is achieved through random sampling. We use genetic algorithms for the optimization of training set composition consisting of tens of thousands of small organic molecules. The resulting machine learning models are considerably more accurate with respect to small randomly selected training sets: mean absolute errors for out-of-sample predictions are reduced to ∼25% for enthalpies, free energies, and zero-point vibrational energy, to ∼50% for heat-capacity, electron-spread, and polarizability, and by more than ∼20% for electronic properties such as frontier orbital eigenvalues or dipole-moments. We discuss and present optimized training sets consisting of 10 molecular classes for all molecular properties studied. We show that these classes can be used to design improved training sets for the generation of machine learning models of the same properties in similar but unrelated molecular sets.
Free energies govern the behavior of soft and liquid matter, and improving their predictions could have a large impact on the development of drugs, electrolytes, or homogeneous catalysts. Unfortunately, it is challenging to devise an accurate description of effects governing solvation such as hydrogen-bonding, van der Waals interactions, or conformational sampling. We present a Free energy Machine Learning (FML) model applicable throughout chemical compound space and based on a representation that employs Boltzmann averages to account for an approximated sampling of configurational space. Using the FreeSolv database, FML's out-of-sample prediction errors of experimental hydration free energies decay systematically with training set size, and experimental uncertainty (0.6 kcal/mol) is reached after training on 490 molecules (80% of FreeSolv). Corresponding FML model errors are on par with state-of-the art physics based approaches. To generate the input representation for a new query compound, FML requires approximate and short molecular dynamics runs. We showcase its usefulness through analysis of solvation free energies for 116k organic molecules (all force-field compatible molecules in the QM9 database), identifying the most and least solvated systems and rediscovering quasi-linear structure-property relationships in terms of simple descriptors such as hydrogen-bond donors, number of NH or OH groups, number of oxygen atoms in hydrocarbons, and number of heavy atoms. FML's accuracy is maximal when the temperature used for the molecular dynamics simulation to generate averaged input representation samples in training is the same as for the query compounds. The sampling time for the representation converges rapidly with respect to the prediction error.
Due to their very nature, ultrafast phenomena are often accompanied by the occurrence of nonadiabatic effects. From a theoretical perspective, the treatment of nonadiabatic processes makes it necessary to go beyond the (quasi) static picture provided by the time-independent Schrödinger equation within the Born-Oppenheimer approximation and to find ways to tackle instead the full time-dependent electronic and nuclear quantum problem. In this review, we give an overview of different nonadiabatic processes that manifest themselves in electronic and nuclear dynamics ranging from the nonadiabatic phenomena taking place during tunnel ionization of atoms in strong laser fields to the radiationless relaxation through conical intersections and the nonadiabatic coupling of vibrational modes and discuss the computational approaches that have been developed to describe such phenomena. These methods range from the full solution of the combined nuclear-electronic quantum problem to a hierarchy of semiclassical approaches and even purely classical frameworks. The power of these simulation tools is illustrated by representative applications and the direct confrontation with experimental measurements performed in the National Centre of Competence for Molecular Ultrafast Science and Technology.
The development of thermostable and solvent-tolerant metalloproteins is a long-sought goal for many applications in synthetic biology and biotechnology. In this work, we were able to engineer a highly thermostable and organic solvent-stable metallo variant of the B1 domain of protein G (GB1) with a tetrahedral zinc binding site reminiscent of the one of thermolysin. Promising candidates were designed computationally by applying a protocol based on classical and first-principles molecular dynamics simulations in combination with genetic algorithm optimization. The most promising of the computationally predicted mutants was expressed and structurally characterized and yielded a highly thermostable protein. The experimental results thus confirm the predictive power of the applied computational protein engineering approach for the de novo design of highly stable metalloproteins.
Conventional kernel-based machine learning models for ab initio potential energy sur-faces, while accurate and convenient in small data regimes, suffer immense compu-tational cost as training set sizes increase. We introduce QML-Lightning, a PyTorchpackage containing GPU-accelerated approximate kernel models, which reduces thetraining time by several orders of magnitude, yielding trained models within sec-onds. QML-Lightning includes a cost-efficient GPU implementation of FCHL19,which together can yield energy and force predictions with competitive accuracy ona microsecond-per-atom timescale. Using modern GPU hardware, we report learningcurves of energies and forces as well as timings as numerical evidence for select legacybenchmarks from atomisitic simulation including QM9, MD-17, and 3BPA.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.