2023
DOI: 10.1039/d3sc03902a
|View full text |Cite
|
Sign up to set email alerts
|

The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions

Zhen Liu,
Yurii S. Moroz,
Olexandr Isayev

Abstract: A sensitive model captures the reactivity cliffs but overfit to yield outliers. On the other hand, a robust model disregards the yield outliers but underfits the reactivity cliffs.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

1
3
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 41 publications
1
3
0
Order By: Relevance
“…Notably, experimental results from these studies indicate that, while DFT descriptors outperform the SMILES-based pretraining fingerprint Rxnfp and the rule-based fingerprint DRFP in literature-extracted data, the performance rankings are reversed when applied to HTE data . The same trend is also reported in another benchmark study . We posit that knowledge-based features, such as DFT-calculated terms, exhibit greater robustness to noisy data, as seen in the literature-extracted reactions.…”
Section: Mainsupporting
confidence: 61%
See 1 more Smart Citation
“…Notably, experimental results from these studies indicate that, while DFT descriptors outperform the SMILES-based pretraining fingerprint Rxnfp and the rule-based fingerprint DRFP in literature-extracted data, the performance rankings are reversed when applied to HTE data . The same trend is also reported in another benchmark study . We posit that knowledge-based features, such as DFT-calculated terms, exhibit greater robustness to noisy data, as seen in the literature-extracted reactions.…”
Section: Mainsupporting
confidence: 61%
“…44 The same trend is also reported in another benchmark study. 147 We posit that knowledge-based features, such as DFT-calculated terms, exhibit greater robustness to noisy data, as seen in the literature-extracted reactions. The noisy data set demonstrates a substantial bias where low-yield reactions are consistently absent.…”
Section: Graphmentioning
confidence: 99%
“…Instead, assay yields are often reported, such as UV area percents, percent conversions, or product/internal standard ratios. , This may pose questions regarding the relevance of using HTE readouts to predict synthetic yields at larger scales due to the confounding factors that isolation may introduce, and as a result, the prediction from a model trained exclusively on HTE data may not necessarily translate into material delivery to assays. Public data sets of varying sizes, sourced from the USPTO, scientific literature, and beyond have also been used to build models for yield prediction and other tasks such as condition recommendation . These data sets, while large, exhibit significant procedural variation among different data sources, causing yield prediction models to exhibit low performance .…”
Section: Introductionmentioning
confidence: 99%
“…Previous ML models for reaction outcome prediction are suitable for either the nonlearned (e.g., molecular descriptors, fingerprints) or learned (e.g., SMILES and graphs) representations of molecular encodings [27][28][29] . In this work, we present an ML model that works well with both kinds of molecular inputs.…”
mentioning
confidence: 99%