The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the original random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. This universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.
Computational modeling of chemical and biological systems at atomic resolution is a crucial tool in the chemist’s toolset. The use of computer simulations requires a balance between cost and accuracy: quantum-mechanical methods provide high accuracy but are computationally expensive and scale poorly to large systems, while classical force fields are cheap and scalable, but lack transferability to new systems. Machine learning can be used to achieve the best of both approaches. Here we train a general-purpose neural network potential (ANI-1ccx) that approaches CCSD(T)/CBS accuracy on benchmarks for reaction thermochemistry, isomerization, and drug-like molecular torsions. This is achieved by training a network to DFT data then using transfer learning techniques to retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology, and chemistry, and billions of times faster than CCSD(T)/CBS calculations.
Optically active molecular materials, such as organic conjugated polymers and biological systems, are characterized by strong coupling between electronic and vibrational degrees of freedom. Typically, simulations must go beyond the Born− Oppenheimer approximation to account for non-adiabatic coupling between excited states. Indeed, non-adiabatic dynamics is commonly associated with exciton dynamics and photophysics involving charge and energy transfer, as well as exciton dissociation and charge recombination. Understanding the photoinduced dynamics in such materials is vital to providing an accurate description of exciton formation, evolution, and decay. This interdisciplinary field has matured significantly over the past decades. Formulation of new theoretical frameworks, development of more efficient and accurate computational algorithms, and evolution of high-performance computer hardware has extended these simulations to very large molecular systems with hundreds of atoms, including numerous studies of organic semiconductors and biomolecules. In this Review, we will describe recent theoretical advances including treatment of electronic decoherence in surface-hopping methods, the role of solvent effects, trivial unavoided crossings, analysis of data based on transition densities, and efficient computational implementations of these numerical methods. We also emphasize newly developed semiclassical approaches, based on the Gaussian approximation, which retain phase and width information to account for significant decoherence and interference effects while maintaining the high efficiency of surface-hopping approaches. The above developments have been employed to successfully describe photophysics in a variety of molecular materials.
Computational modeling of chemical and biological systems at atomic resolution is a crucial tool in the chemist's toolset. The use of computer simulations requires a balance between cost and accuracy: quantum-mechanical methods provide high accuracy but are computationally expensive and scale poorly to large systems, while classical force fields are cheap and scalable, but lack transferability to new systems. Machine learning can be used to achieve the best of both approaches. Here we train a general-purpose neural network potential (ANI-1ccx) that approaches CCSD(T)/CBS accuracy on benchmarks for reaction thermochemistry, isomerization, and druglike molecular torsions. This is achieved by training a network to DFT data then using transfer learning techniques to retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology and chemistry, and billions of times faster than CCSD(T)/CBS calculations.
Myoglobin (Mb) double mutant T67R/S92D displays peroxidase enzymatic activity in contrast to the wild type protein. The CO adduct of T67R/S92D shows two CO absorption bands corresponding to the A 1 and A 3 substates. The equilibrium protein dynamics for the two distinct substates of the Mb double mutant are investigated by using two dimensional infrared (2D IR) vibrational echo spectroscopy and molecular dynamics (MD) simulations. The time dependent changes in the 2D IR vibrational echo line shapes for both the substates are analyzed using the center line slope (CLS) method to obtain the frequency-frequency correlation function (FFCF). The results for the double mutant are compared to those from the wild type Mb. The experimentally determined FFCF is compared to the FFCF obtained from molecular dynamics simulations, thereby testing the capacity of a force field to determine the amplitudes and time scales of protein structural fluctuations on fast timescales. The results provide insights into the nature of the energy landscape around the free energy minimum of the folded protein structure.
Maximum diversification of data is a central theme in building generalized and accurate machine learning (ML) models. In chemistry, ML has been used to develop models for predicting molecular properties, for example quantum mechanics (QM) calculated potential energy surfaces and atomic charge models. The ANI-1x and ANI-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process. Here, we describe the ANI-1x and ANI-1ccx data sets. To demonstrate data diversity, we visualize it with a dimensionality reduction scheme, and contrast against existing data sets. The ANI-1x data set contains multiple QM properties from 5 M density functional theory calculations, while the ANI-1ccx data set contains 500 k data points obtained with an accurate CCSD(T)/CBS extrapolation. Approximately 14 million CPU core-hours were expended to generate this data. Multiple QM calculated properties for the chemical elements C, H, N, and O are provided: energies, atomic forces, multipole moments, atomic charges, etc. We provide this data to the community to aid research and development of ML models for chemistry.
We present a versatile new code released for open community use, the nonadiabatic excited state molecular dynamics (NEXMD) package. This software aims to simulate nonadiabatic excited state molecular dynamics using several semiempirical Hamiltonian models. To model such dynamics of a molecular system, the NEXMD uses the fewest-switches surface hopping algorithm, where the probability of transition from one state to another depends on the strength of the derivative nonadiabatic coupling. In addition, there are a number of algorithmic improvements such as empirical decoherence corrections and tracking trivial crossings of electronic states. While the primary intent behind the NEXMD was to simulate nonadiabatic molecular dynamics, the code can also perform geometry optimizations, adiabatic excited state dynamics, and single-point calculations all in vacuum or in a simulated solvent. In this report, first, we lay out the basic theoretical framework underlying the code. Then we present the code’s structure and workflow. To demonstrate the functionality of NEXMD in detail, we analyze the photoexcited dynamics of a polyphenylene ethynylene dendrimer (PPE, C30H18) in vacuum and in a continuum solvent. Furthermore, the PPE molecule example serves to highlight the utility of the getexcited.py helper script to form a streamlined workflow. This script, provided with the package, can both set up NEXMD calculations and analyze the results, including, but not limited to, collecting populations, generating an average optical spectrum, and restarting unfinished calculations.
The ability to accurately and efficiently compute quantum-mechanical partial atomistic charges has many practical applications, such as calculations of IR spectra, analysis of chemical bonding, and classical force field parametrization. Machine learning (ML) techniques provide a possible avenue for the efficient prediction of atomic partial charges. Modern ML advances in the prediction of molecular energies [i.e., the hierarchical interacting particle neural network (HIP-NN)] has provided the necessary model framework and architecture to predict transferable, extensible, and conformationally dynamic atomic partial charges based on reference density functional theory (DFT) simulations. Utilizing HIP-NN, we show that ML charge prediction can be highly accurate over a wide range of molecules (both small and large) across a variety of charge partitioning schemes such as the Hirshfeld, CM5, MSK, and NBO methods. To demonstrate transferability and size extensibility, we compare ML results with reference DFT calculations on the COMP6 benchmark, achieving errors of 0.004e (elementary charge). This is remarkable since this benchmark contains two proteins that are multiple times larger than the largest molecules in the training set. An application of our atomic charge predictions on nonequilibrium geometries is the generation of IR spectra for organic molecules from dynamical trajectories on a variety of organic molecules, which show good agreement with calculated IR spectra with reference method. Critically, HIP-NN charge predictions are many orders of magnitude faster than direct DFT calculations. These combined results provide further evidence that ML (specifically HIP-NN) provides a pathway to greatly increase the range of feasible simulations while retaining quantum-level accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.