When analyzing chemical reactions it is essential to know which molecules are actively involved in the reaction and which educts will form the product molecules. Assigning reaction roles, like reactant, reagent, or product, to the molecules of a chemical reaction might be a trivial problem for hand-curated reaction schemes but it is more difficult to automate, an essential step when handling large amounts of reaction data. Here, we describe a new fingerprint-based and data-driven approach to assign reaction roles which is also applicable to rather unbalanced and noisy reaction schemes. Given a set of molecules involved and knowing the product(s) of a reaction we assign the most probable reactants and sort out the remaining reagents. Our approach was validated using two different data sets: one hand-curated data set comprising about 680 diverse reactions extracted from patents which span more than 200 different reaction types and include up to 18 different reactants. A second set consists of 50 000 randomly picked reactions from US patents. The results of the second data set were compared to results obtained using two different atom-to-atom mapping algorithms. For both data sets our method assigns the reaction roles correctly for the vast majority of the reactions, achieving an accuracy of 88% and 97% respectively. The median time needed, about 8 ms, indicates that the algorithm is fast enough to be applied to large collections. The new method is available as part of the RDKit toolkit and the data sets and Jupyter notebooks used for evaluation of the new method are available in the Supporting Information of this publication.
An extended reduced graph approach (ErG) is presented that uses pharmacophore-type node descriptions to encode the relevant molecular properties. The basic idea of the method can be described as a hybrid approach of reduced graphs (Gillet et al. J. Chem. Inf. Comput. Sci. 2003, 43, 338-345) and binding property pairs (Kearsley et al. J. Chem. Inf. Comput. Sci. 1996, 36, 118-127). However, specific extension modifications to correctly describe the pharmacophoric properties, size, and shape of the molecules under study result in a very stable and good performance as compared to DAYLIGHT fingerprints (DFP). This is exemplified for 11 activity classes of the MDL Drug Data Report database, for which ErG performs as well or better than DFP in 10 cases. On the basis of the example data sets, the ability of ErG to switch from one chemotype to another (often referred to as "scaffold hopping") is highlighted. Additionally, possible pitfalls of reduced graph approaches as well as suitable solutions are discussed with the help of example structures. Overall, it is shown that ErG is a widely applicable method capable of identifying structurally diverse actives for a given active search query. This diversity is achieved by a high degree of molecular abstraction, which in turn results in a low dimensional descriptor vector that allows very low computation times for similarity searches.
A general purpose force field such as MMFF94/MMFF94s, which can properly deal with a
wide range of diverse structures, is very valuable in the context of a
cheminformatics toolkit. Herein we present an open-source implementation of this
force field within the RDKit. The new MMFF functionality can be accessed through a
C++/C#/Python/Java application programming interface (API) developed along the lines
of the one already available for UFF in the RDKit. Our implementation was fully
validated against the official validation suite provided by the MMFF authors. All
energies and gradients were correctly computed; moreover, atom type and force
constants were correctly assigned for 3D molecules built from SMILES strings. To
provide full flexibility, the available API provides direct access to include/exclude
individual terms from the MMFF energy expression and to carry out constrained
geometry optimizations. The availability of a MMFF-capable molecular mechanics engine
coupled with the rest of the RDKit functionality and covered by the BSD license is
appealing to researchers operating in both academia and industry.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.