One of the most commonly used clustering algorithms within the worldwide pharmaceutical industry is Jarvis-Patrick's (J-P) (Jarvis, R. A. IEEE Trans. Comput. 1973Comput. , C-22, 1025Comput. -1034. The implementation of J-P under Daylight software, using Daylight's fingerprints and the Tanimoto similarity index, can deal with sets of 100 k molecules in a matter of a few hours. However, the J-P clustering algorithm has several associated problems which make it difficult to cluster large data sets in a consistent and timely manner. The clusters produced are greatly dependent on the choice of the two parameters needed to run J-P clustering, such that this method tends to produce clusters which are either very large and heterogeneous or homogeneous but too small. In any case, J-P always requires time-consuming manual tuning. This paper describes an algorithm which will identify dense clusters where similarity within each cluster reflects the Tanimoto value used for the clustering, and, more importantly, where the cluster centroid will be at least similar, at the given Tanimoto value, to every other molecule within the cluster in a consistent and automated manner. The similarity term used throughout this paper reflects the oVerall similarity between two given molecules, as defined by Daylight's fingerprints and the Tanimoto similarity index.
INTRODUCTIONClustering 2,3 has been described as 'the art of finding groups in data' 4 and is widely used within the pharmaceutical industry to design different representative sets. Most common uses of representative sets could be as training sets in the development of different structure-activity models and for screening in different biological screens. In both cases, one would assume that the cluster centroid is a good representative member of the corresponding cluster. It is therefore of great importance to be able to create homogeneous clusters in a consistent way and to deal with either small or very large sets equally well. Our approach uses desired similarity within the cluster, as defined by Tanimoto index, as the only input to the clustering program.
METHODOLOGYThere are three key steps in this clustering approach: 1. generation of standard Daylight's fingerprints (ASCII); 2. identification of potential cluster centroids; 3. clustering based on the exclusion spheres. 1. Generation of Fingerprints. Fingerprints for each molecule are generated, using Daylight software, as an ASCII string of 1's and 0's (fixed width at 1024). See Appendix 1 for more details on the concept of Daylight's fingerprints.2. Identifying Potential Cluster Centroids. It is reasonable to postulate that a molecule within a given cluster which has the largest number of neighbors and is therefore 'most like' the rest of the cluster is a good choice to become a cluster centroid. To identify such molecules, we calculate the number of neighbors for each molecule in the set, at the Tanimoto level chosen for the clustering. The set is then sorted in descending order, so that the potential cluster centroids, ...
Additive models for the estimation of Abraham's molecular descriptors R 2 , π 2 H , ΣR 2 H , Σβ 2 H , Σβ 2 O , and log L 16 have been developed. For five of the six descriptors, one set of 81 atom and functional group fragments is capable of reproducing experimentally derived results with correlation coefficients ranging from 0.95 to 0.99. However, one descriptor, ΣR 2 H , required an entirely separate set of 51 fragments to be developed, resulting in a correlation coefficient of 0.97. Of particular importance is the speed of calculation (approximately 700 molecules/min), allowing so-called "high-throughput screening". Several applications of this model for molecules containing intramolecular interactions are discussed.
A previously published method for the prediction of molecular linear free energy relationship descriptors is tested against experimentally determined partition coefficients in various solvent systems. Sets of partition data between water and octanol, cyclohexane, and chloroform were taken from the literature. For each set of partition data used, r2 values ranged from 0.8 to 0.9 and RMS errors from 0.7 to 1.0 log unit, comparable to errors obtained with previously published models for octanol-water partition. Modified solvation equations for water-octanol and water-cyclohexane partition are presented, and their implications discussed. The possibility of applying the current approach to a wide range of solvation and transport properties is put forward.
A new approach to the use of commercial databases for the dereplication of purified natural products has been developed. This is based on searching a text file that links each structure with its molecular weight and an exact count of the number of methyl, methylene, and methine groups it contains. Analysis of such a text file, constructed from a database containing more than 126,000 natural product structures, revealed that these data, readily measured using MS and NMR spectroscopy, are highly discriminating. The identification of an alkaloid and a sesquiterpene using this new approach is described.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.