Machine learning the electronic structure of open shell transition metal complexes presents unique challenges, including robust and automated data set generation. Here, we introduce tools that simplify data acquisition from density functional theory (DFT) and validation of trained machine learning models using the molSimplify automatic design (mAD) workflow. We demonstrate this workflow by training and comparing the performance of LASSO, kernel ridge regression (KRR), and artificial neural network (ANN) models using heuristic, topological revised autocorrelation (RAC) descriptors we have recently introduced for machine learning inorganic chemistry. On a series of open shell transition metal complexes, we evaluate set aside test errors of these models for predicting the HOMO level and HOMO-LUMO gap. The best performing models are ANNs, which show 0.15 and 0.25 eV test set mean absolute errors on the HOMO level and HOMO-LUMO gap, respectively. Poor performing KRR models using the full 153-feature RAC set are improved to nearly the same performance as the ANNs when trained on down-selected subsets of 20-30 features. Analysis of the essential descriptors for HOMO and HOMO-LUMO gap prediction as well as comparison to subsets previously obtained for other properties reveals the paramount importance of non-local, steric properties in determining frontier molecular orbital energetics. We demonstrate our model performance on diverse complexes and in the discovery of molecules with target HOMO-LUMO gaps from a large 15,000 molecule design space in minutes rather than days that full DFT evaluation would require.
The accelerated discovery of materials for real world applications requires the achievement of multiple design objectives. The multidimensional nature of the search necessitates exploration of multimillion compound libraries over which even density functional theory (DFT) screening is intractable. Machine learning (e.g., artificial neural network, ANN, or Gaussian process, GP) models for this task are limited by training data availability and predictive uncertainty quantification (UQ). We overcome such limitations by using efficient global optimization (EGO) with the multidimensional expected improvement (EI) criterion. EGO balances exploitation of a trained model with acquisition of new DFT data at the Pareto front, the region of chemical space that contains the optimal trade-off between multiple design criteria. We demonstrate this approach for the simultaneous optimization of redox potential and solubility in candidate M(II)/M(III) redox couples for redox flow batteries from a space of 2.8 M transition metal complexes designed for stability in practical redox flow battery (RFB) applications. We show that a multitask ANN with latent-distance-based UQ surpasses the generalization performance of a GP in this space. With this approach, ANN prediction and EI scoring of the full space are achieved in minutes. Starting from ca. 100 representative points, EGO improves both properties by over 3 standard deviations in only five generations. Analysis of lookahead errors confirms rapid ANN model improvement during the EGO process, achieving suitable accuracy for predictive design in the space of transition metal complexes. The ANN-driven EI approach achieves at least 500-fold acceleration over random search, identifying a Pareto-optimal design in around 5 weeks instead of 50 years.
Recent transformative advances in computing power and algorithms have made computational chemistry central to the discovery and design of new molecules and materials. First-principles simulations are increasingly accurate and applicable to large systems with the speed needed for high-throughput computational screening. Despite these strides, the combinatorial challenges associated with the vastness of chemical space mean that more than just fast and accurate computational tools are needed for accelerated chemical discovery. In transition-metal chemistry and catalysis, unique challenges arise. The variable spin, oxidation state, and coordination environments favored by elements with well-localized d or f electrons provide great opportunity for tailoring properties in catalytic or functional (e.g., magnetic) materials but also add layers of uncertainty to any design strategy. We outline five key mandates for realizing computationally driven accelerated discovery in inorganic chemistry: (i) fully automated simulation of new compounds, (ii) knowledge of prediction sensitivity or accuracy, (iii) faster-than-fast property prediction methods, (iv) maps for rapid chemical space traversal, and (v) a means to reveal design rules on the kilocompound scale. Through case studies in open-shell transition-metal chemistry, we describe how advances in methodology and software in each of these areas bring about new chemical insights. We conclude with our outlook on the next steps in this process toward realizing fully autonomous discovery in inorganic chemistry using computational chemistry.
A predictive approach for driving down machine learning model errors is introduced and demonstrated across discovery for inorganic and organic chemistry.
High-throughput computational screening for chemical discovery mandates the automated and unsupervised simulation of thousands of new molecules and materials. In challenging materials spaces, such as open shell transition metal chemistry, characterization requires time-consuming first-principles simulation that often necessitates human intervention. These calculations can frequently lead to a null result, e.g., the calculation does not converge or the molecule does not stay intact during a geometry optimization. To overcome this challenge toward realizing fully automated chemical discovery in transition metal chemistry, we have developed the first machine learning models that predict the likelihood of successful simulation outcomes. We train support vector machine and artificial neural network classifiers to predict simulation outcomes (i.e., geometry optimization result and degree of deviation) for a chosen electronic structure method based on chemical composition. For these static models, we achieve an area under the curve of at least 0.95, minimizing computational time spent on non-productive simulations and therefore enabling efficient chemical space exploration. We introduce a metric of model uncertainty based on the distribution of points in the latent space to systematically improve model prediction confidence. In a complementary approach, we train a convolutional neural network classification model on simulation output electronic and geometric structure time series data. This dynamic model generalizes more readily than the static classifier by becoming more predictive as input simulation length increases. Finally, we describe approaches for using these models to enable autonomous job control in transition metal complex discovery. File list (3) download file view on ChemRxiv DuanClassifier.pdf (2.94 MiB) download file view on ChemRxiv SIClassifier_v5.pdf (5.28 MiB) download file view on ChemRxiv data_set.zip (35.83 MiB)
High-throughput computational screening typically employs methods (i.e., density functional theory or DFT) that can fail to describe challenging molecules, such as those with strongly correlated electronic structure. In such cases, multireference (MR) correlated wavefunction theory (WFT) would be the appropriate choice but remains more challenging to carry out and automate than single-reference (SR) WFT or DFT. Numerous diagnostics have been proposed for identifying when MR character is likely to have an effect on the predictive power of SR calculations, but conflicting conclusions about diagnostic performance have been reached on small data sets. We compute 15 MR diagnostics, ranging from affordable DFT-based to more costly MR-WFT-based diagnostics, on a set of 3,165 equilibrium and distorted small organic molecules containing up to six heavy atoms. Conflicting MR character assignments and low pairwise linear correlations among diagnostics are also observed over this set. We evaluate the ability of existing diagnostics to predict the percent recovery of the correlation energy, %E corr . None of the DFT-based diagnostics are nearly as predictive of %E corr as the best WFT-based diagnostics. To overcome the limitation of this cost-accuracy trade-off, we develop machine learning (ML, i.e., kernel ridge regression) models to predict WFT-based diagnostics from a combination of DFT-based diagnostics and a new, size-independent 3D geometric representation. The ML-predicted diagnostics correlate as well with MR effects as their computed (i.e., with WFT) values, significantly improving over the DFT-based diagnostics on which the models were trained.These ML models thus provide a promising approach to improve upon DFT-based diagnostic accuracy while remaining suitably low cost for high-throughput screening. File list (4) download file view on ChemRxiv MRML1_v14.pdf (3.25 MiB) download file view on ChemRxiv SI_MRML1_v6.pdf (8.52 MiB) download file view on ChemRxiv SI_MRML1_Data_04112020.zip (4.00 MiB)
Metal−oxo moieties are important catalytic intermediates in the selective partial oxidation of hydrocarbons and in water splitting. Stable metal−oxo species have reactive properties that vary depending on the spin state of the metal, complicating the development of structure−property relationships. To overcome these challenges, we train machine-learning (ML) models capable of predicting metal−oxo formation energies across a range of first-row metals, oxidation states, and spin states. Using connectivity-only features tailored for inorganic chemistry as inputs to kernel ridge regression or artificial neural network (ANN) ML models, we achieve good mean absolute errors (4−5 kcal/mol) on set-aside test data across a range of ligand orientations. Analysis of feature importance for oxo formation energy prediction reveals the dominance of nonlocal, electronic ligand properties in contrast to other transition metal complex properties (e.g., spin-state or ionization potential). We enumerate the theoretical catalyst space with an ANN, revealing expected trends in oxo formation energetics, such as destabilization of the metal−oxo species with increasing d-filling, as well as exceptions, such as weak correlations with indicators of oxidative stability of the metal in the resting state or unexpected spin-state dependence in reactivity. We carry out uncertainty-aware evolutionary optimization using the ANN to explore a >37 000 candidate catalyst space. New metal and oxidation state combinations are uncovered and validated with density functional theory (DFT), including counterintuitive oxo formation energies for oxidatively stable complexes. This approach doubles the density of confirmed DFT leads in originally sparsely populated regions of property space, highlighting the potential of ML-model-driven discovery to uncover catalyst design rules and exceptions.
Transition-metal complexes are attractive targets for the design of catalysts and functional materials. The behavior of the metal−organic bond, while very tunable for achieving target properties, is challenging to predict and necessitates searching a wide and complex space to identify needles in haystacks for target applications. This review will focus on the techniques that make high-throughput search of transition-metal chemical space feasible for the discovery of complexes with desirable properties. The review will cover the development, promise, and limitations of "traditional" computational chemistry (i.e., force field, semiempirical, and density functional theory methods) as it pertains to data generation for inorganic molecular discovery. The review will also discuss the opportunities and limitations in leveraging experimental data sources. We will focus on how advances in statistical modeling, artificial intelligence, multiobjective optimization, and automation accelerate discovery of lead compounds and design rules. The overall objective of this review is to showcase how bringing together advances from diverse areas of computational chemistry and computer science have enabled the rapid uncovering of structure−property relationships in transition-metal chemistry. We aim to highlight how unique considerations in motifs of metal−organic bonding (e.g., variable spin and oxidation state, and bonding strength/nature) set them and their discovery apart from more commonly considered organic molecules. We will also highlight how uncertainty and relative data scarcity in transition-metal chemistry motivate specific developments in machine learning representations, model training, and in computational chemistry. Finally, we will conclude with an outlook of areas of opportunity for the accelerated discovery of transition-metal complexes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.