The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions

Liu, Zhen; Moroz, Yurii S.; Isayev, Olexandr

doi:10.1039/d3sc03902a

Cited by 4 publications

(4 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Notably, experimental results from these studies indicate that, while DFT descriptors outperform the SMILES-based pretraining fingerprint Rxnfp and the rule-based fingerprint DRFP in literature-extracted data, the performance rankings are reversed when applied to HTE data . The same trend is also reported in another benchmark study . We posit that knowledge-based features, such as DFT-calculated terms, exhibit greater robustness to noisy data, as seen in the literature-extracted reactions.…”

Section: Mainsupporting

confidence: 61%

See 1 more Smart Citation

Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective

Ding,

Qiang,

Chen

et al. 2024

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Chemical reactions serve as foundational building blocks for organic chemistry and drug design. In the era of large AI models, data-driven approaches have emerged to innovate the design of novel reactions, optimize existing ones for higher yields, and discover new pathways for synthesizing chemical structures comprehensively. To effectively address these challenges with machine learning models, it is imperative to derive robust and informative representations or engage in feature engineering using extensive data sets of reactions. This work aims to provide a comprehensive review of established reaction featurization approaches, offering insights into the selection of representations and the design of features for a wide array of tasks. The advantages and limitations of employing SMILES, molecular fingerprints, molecular graphs, and physics-based properties are meticulously elaborated. Solutions to bridge the gap between different representations will also be critically evaluated. Additionally, we introduce a new frontier in chemical reaction pretraining, holding promise as an innovative yet unexplored avenue.

show abstract

Section: Mainsupporting

confidence: 61%

“…44 The same trend is also reported in another benchmark study. 147 We posit that knowledge-based features, such as DFT-calculated terms, exhibit greater robustness to noisy data, as seen in the literature-extracted reactions. The noisy data set demonstrates a substantial bias where low-yield reactions are consistently absent.…”

Section: Graphmentioning

confidence: 99%

Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective

Ding,

Qiang,

Chen

et al. 2024

J. Chem. Inf. Model.

View full text Add to dashboard Cite

show abstract

“…Instead, assay yields are often reported, such as UV area percents, percent conversions, or product/internal standard ratios. , This may pose questions regarding the relevance of using HTE readouts to predict synthetic yields at larger scales due to the confounding factors that isolation may introduce, and as a result, the prediction from a model trained exclusively on HTE data may not necessarily translate into material delivery to assays. Public data sets of varying sizes, sourced from the USPTO, scientific literature, and beyond have also been used to build models for yield prediction and other tasks such as condition recommendation . These data sets, while large, exhibit significant procedural variation among different data sources, causing yield prediction models to exhibit low performance .…”

Section: Introductionmentioning

confidence: 99%

Incorporating Synthetic Accessibility in Drug Design: Predicting Reaction Yields of Suzuki Cross-Couplings by Leveraging AbbVie’s 15-Year Parallel Library Data Set

Raghavan,

Rago,

Verma

et al. 2024

J. Am. Chem. Soc.

View full text Add to dashboard Cite

Despite the increased use of computational tools to supplement medicinal chemists’ expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie’s medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields. The combination of density functional theory (DFT)-derived features and Morgan fingerprints was identified to perform better than one-hot encoded baseline modeling, furnishing encouraging results. Overall, we observe modest generalization to unseen reactant structures within the 15-year retrospective library data set. Additionally, we compare predictions made by the model to those made by expert medicinal chemists, finding that the model can often predict both reaction success and reaction yields with greater accuracy. Finally, we demonstrate the application of this approach to suggest structurally and electronically similar building blocks to replace those predicted or observed to be unsuccessful prior to or after synthesis, respectively. The yield prediction model was used to select similar monomers predicted to have higher yields, resulting in greater synthesis efficiency of relevant drug-like molecules.

show abstract

“…Previous ML models for reaction outcome prediction are suitable for either the nonlearned (e.g., molecular descriptors, fingerprints) or learned (e.g., SMILES and graphs) representations of molecular encodings [27][28][29] . In this work, we present an ML model that works well with both kinds of molecular inputs.…”

mentioning

confidence: 99%

Deep Kernel learning for reaction outcome prediction and optimization

Singh,

Hernández-Lobato

2024

Commun Chem

View full text Add to dashboard Cite

Recent years have seen a rapid growth in the application of various machine learning methods for reaction outcome prediction. Deep learning models have gained popularity due to their ability to learn representations directly from the molecular structure. Gaussian processes (GPs), on the other hand, provide reliable uncertainty estimates but are unable to learn representations from the data. We combine the feature learning ability of neural networks (NNs) with uncertainty quantification of GPs in a deep kernel learning (DKL) framework to predict the reaction outcome. The DKL model is observed to obtain very good predictive performance across different input representations. It significantly outperforms standard GPs and provides comparable performance to graph neural networks, but with uncertainty estimation. Additionally, the uncertainty estimates on predictions provided by the DKL model facilitated its incorporation as a surrogate model for Bayesian optimization (BO). The proposed method, therefore, has a great potential towards accelerating reaction discovery by integrating accurate predictive models that provide reliable uncertainty estimates with BO.

show abstract

The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions

Abstract: A sensitive model captures the reactivity cliffs but overfit to yield outliers. On the other hand, a robust model disregards the yield outliers but underfits the reactivity cliffs.

Cited by 4 publications

References 41 publications

Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective

Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective

Incorporating Synthetic Accessibility in Drug Design: Predicting Reaction Yields of Suzuki Cross-Couplings by Leveraging AbbVie’s 15-Year Parallel Library Data Set

Deep Kernel learning for reaction outcome prediction and optimization

Contact Info

Product

Resources

About