2021
DOI: 10.1002/minf.202100119
|View full text |Cite
|
Sign up to set email alerts
|

Reaction Data Curation I: Chemical Structures and Transformations Standardization

Abstract: The quality of experimental data for chemical reactions is a critical consideration for any reaction-driven study. However, the curation of reaction data has not been extensively discussed in the literature so far. Here, we suggest a 4 steps protocol that includes the curation of individual structures (reactants and products), chemical transformations, reaction conditions and endpoints. Its implementation in Python3 using CGRTools toolkit has been used to clean three popular reaction databases Reaxys, USPTO an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
34
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

3
5

Authors

Journals

citations
Cited by 26 publications
(38 citation statements)
references
References 63 publications
(102 reference statements)
0
34
0
Order By: Relevance
“…CGRs can be obtained for both balanced and imbalanced reactions, and imbalanced reactions can be balanced via decomposition of the CGR. 44 However, correct labels for missing atoms and bonds can only be recovered for some but not all reactions using CGR decomposition, namely, if no rearrangements occurs within the missing fragments. An automatic balancing via the CGR therefore potentially introduces noise to a data set, if some of the missing fragments are wrongly autocompleted.…”
Section: Methodsmentioning
confidence: 99%
“…CGRs can be obtained for both balanced and imbalanced reactions, and imbalanced reactions can be balanced via decomposition of the CGR. 44 However, correct labels for missing atoms and bonds can only be recovered for some but not all reactions using CGR decomposition, namely, if no rearrangements occurs within the missing fragments. An automatic balancing via the CGR therefore potentially introduces noise to a data set, if some of the missing fragments are wrongly autocompleted.…”
Section: Methodsmentioning
confidence: 99%
“…The model is trained on the combined open-source reaction dataset USPTO 29 and commercial reaction dataset Pistachio 19 . The data normalization followed the process described in 30 and duplicated entries were removed.…”
Section: Data Setsmentioning
confidence: 99%
“…The initial dataset of one-step hydrogenation reactions containing 591,563 reactions (391,880 chemical transformations) was extracted from the Reaxys ® database in May 2019. We follow the same terminology as in our earlier publication [ 13 ]: by “transformation” we mean a set of reactants and products, “reaction” is a transformation carried out in the given conditions. Hydrogenation reactions were revealed by the presence of “H2” or “hydrogen” keyword in the reagent list for at least one condition corresponding to a reaction.…”
Section: Computational Proceduresmentioning
confidence: 99%
“…Hydrogenation reactions were revealed by the presence of “H2” or “hydrogen” keyword in the reagent list for at least one condition corresponding to a reaction. Chemical structures were standardized according to the protocol described by Gimadiev et al [ 13 ]. CGRtools [ 14 ] was used for functional group normalization, aromatization, removing explicit hydrogens and duplicate cleaning.…”
Section: Computational Proceduresmentioning
confidence: 99%
See 1 more Smart Citation