For a number of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in molecular similarity. Classification performance for a test data set of 957 compounds was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset containing 208 bits and 0.71 for a genetic algorithm optimized keyset containing 548 bits. We present an overview of the underlying technology supporting the definition of descriptors and the encoding of these descriptors into keysets. This technology allows definition of descriptors as combinations of atom properties, bond properties, and atomic neighborhoods at various topological separations as well as supporting a number of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodology developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning experiment highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal S/N ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded.
For a number of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in molecular similarity. Classification performance for a test data set of 957 compounds was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset containing 208 bits and 0.71 for a genetic algorithm optimized keyset containing 548 bits. We present an overview of the underlying technology supporting the definition of descriptors and the encoding of these descriptors into keysets. This technology allows definition of descriptors as combinations of atom properties, bond properties, and atomic neighborhoods at various topological separations as well as supporting a number of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodology developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning experiment highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal S/N ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded.
Three-dimensional structure databases, and their accompanying searching software, have been available for several years, both from commercial software vendors and by in-house development. Commercially available systems include MACCS-I1/3D and ISIS/3D (MDL Information Systems, Inc.), Aladdin (Abbott Laboratories and Daylight Chemical Information Systems, Inc.), ChemDBS-3D (Chemical Design Ltd.), SYBYL/3DB Unity (Tripos Associates), and Catalyst (BioCad Corp.). With the exception of ChemDBS-3D and, most recently, SY BYL/3D Unity, these software products apply geometric searching algorithms to static 3D models. This is acceptable for relatively rigid structures but fails to take into account the inherent flexibility of many molecules of biological interest. Approaches to this problem have included the registration of multiple conformations (all products), conformational analysis at registration and search time (ChemDBS-3D), development of 3D queries that can accommodate limited flexibility in the target structures (MACCS-I1/3D, ISIS/3D), and, most recently, application of the directed tweak approach (SYBYL/3DB Unity). This paper will discuss a new approach to the problem, implemented within the ISIS/3D software. The method uses a multilevel screening and constraint-fitting approach, applying torsional optimization with van der Waals energy contributions in the later stages. Methodology and examples are covered.
Using a small database of defined substrates in humans for cytochrome P450 mixed function oxidases, a series of descriptors and classification methods were evaluated with respect to how well they correctly classified substrates. The descriptors ranged from structural keys to topological to electronic. A variety of classification schemes were examined in terms of their ability to point out which descriptors are important for predicting the cytochrome P450 specificity for a substrate. Results illustrate the relative effectiveness of the various kinds of descriptors and classification methods, as well as the value of using as well-defined data set as possible.
A pattern-recognition analysis using the ADAPT system was performed on a set of 9-anilinoacridine antitumor agents, to determine whether computer-generated descriptors could be used to separate active from inactive compounds. A training set of 213 compounds was chosen by random computer selection from a list of 776 structures. Maximal increase in life span at the LD10 dosage, a response which is difficult to model using traditional Hansch analysis, was used as the measure of biological activity. A set of 18 molecular descriptors, including fragment, substructure environment, and physicochemical property descriptors (molar refraction, partial electronic charge) was identified which could correctly classify 94% of the compounds in the training set (97% of active and 85% of inactive compounds). Eight of the inactive compounds that were misclassified contained amino substituents, suggesting a role for ionization. The weight vector that was obtained from the training set was applied to a prediction set of 50 compounds that were not included in the original analysis and to a set of 69 structures drawn from the recent literature. The prediction set results, ranging from 73 to 86% correct, were lower than those of the training set, but they clearly indicate that pattern-recognition techniques can be useful in the screening of proposed or already existing agents and especially useful for the identification of active compounds.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.