We have developed an algorithm that clusters structural databases using topological similarity. The first step in this procedure is to identify a set of probe structures that all fall outside a defined similarity score cutoff with respect to one another. This list of probes is then used to bin the remaining compounds in the database. In the last step, some housekeeping is performed to ensure that each compound in the dataset is either a probe or is contained in one and only one bin. We have applied this clustering method to a database of ∼27 000 compounds for which we have screening level biological data. Analysis of the resulting clusters shows that clusters defined by an active probe are much more likely to contain other active compounds than clusters defined by an inactive probe. Indeed, the incidence of active compounds in bins with active probes is anywhere from 6 to 10 times greater than the incidence of active compounds in the database as a whole. This results demonstrates the power of simple two-dimensional topological descriptors, and serves to validate our clustering algorithm.
It is often impractical to synthesize and test all compounds in a large exhaustive chemical library. Herein, we discuss rational approaches to selecting representative subsets of virtual libraries that help direct experimental synthetic efforts for diverse library design. We compare the performance of two stochastic sampling algorithms, Simulating Annealing Guided Evaluation (SAGE; Zheng, W.; Cho, S. J.; Waller, C. L.; Tropsha, A. J. Chem. Inf. Comput. Sci. 1999, 39, 738-746.) and Stochastic Cluster Analysis (SCA; Reynolds, C. H.; Druker, R.; Pfahler, L. B. Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds J. Chem. Inf. Comput. Sci. 1998, 38, 305-312.) for their ability to select both diverse and representative subsets of the entire chemical library space. The SAGE and SCA algorithms were compared using u- and s-optimal metrics as an independent assessment of diversity and coverage. This comparison showed that both algorithms were capable of generating sublibraries in descriptor space that are diverse and give reasonable coverage (i.e. are representative) of the original full library. Tests were carried out using simulated two-dimensional data sets and a 27 000 compound proprietary structural library as represented by computed Molconn-Z descriptors. One of the key observations from this work is that the algorithmically simple SCA method is capable of selecting subsets that are comparable to the more computationally intensive SAGE method.
Scaling is a difficult issue for any analysis of chemical properties or molecular topology when disparate descriptors are involved. To compare properties across different data sets, a common scale must be defined. Using several publicly available databases (ACD, CMC, MDDR, and NCI) as a basis, we propose to define chemically meaningful scales for a number of molecular properties and topology descriptors. These chemically derived scaling functions have several advantages. First, it is possible to define chemically relevant scales, greatly simplifying similarity and diversity analyses across data sets. Second, this approach provides a convenient method for setting descriptor boundaries that define chemically reasonable topology spaces. For example, descriptors can be scaled so that compounds with little potential for biological activity, bioavailability, or other drug-like characteristics are easily identified as outliers. We have compiled scaling values for 314 molecular descriptors. In addition the 10th and 90th percentile values for each descriptor have been calculated for use in outlier filtering.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.