Darko Butina scite author profile

One of the most commonly used clustering algorithms within the worldwide pharmaceutical industry is Jarvis-Patrick's (J-P) (Jarvis, R. A. IEEE Trans. Comput. 1973Comput. , C-22, 1025Comput. -1034. The implementation of J-P under Daylight software, using Daylight's fingerprints and the Tanimoto similarity index, can deal with sets of 100 k molecules in a matter of a few hours. However, the J-P clustering algorithm has several associated problems which make it difficult to cluster large data sets in a consistent and timely manner. The clusters produced are greatly dependent on the choice of the two parameters needed to run J-P clustering, such that this method tends to produce clusters which are either very large and heterogeneous or homogeneous but too small. In any case, J-P always requires time-consuming manual tuning. This paper describes an algorithm which will identify dense clusters where similarity within each cluster reflects the Tanimoto value used for the clustering, and, more importantly, where the cluster centroid will be at least similar, at the given Tanimoto value, to every other molecule within the cluster in a consistent and automated manner. The similarity term used throughout this paper reflects the oVerall similarity between two given molecules, as defined by Daylight's fingerprints and the Tanimoto similarity index. INTRODUCTIONClustering 2,3 has been described as 'the art of finding groups in data' 4 and is widely used within the pharmaceutical industry to design different representative sets. Most common uses of representative sets could be as training sets in the development of different structure-activity models and for screening in different biological screens. In both cases, one would assume that the cluster centroid is a good representative member of the corresponding cluster. It is therefore of great importance to be able to create homogeneous clusters in a consistent way and to deal with either small or very large sets equally well. Our approach uses desired similarity within the cluster, as defined by Tanimoto index, as the only input to the clustering program. METHODOLOGYThere are three key steps in this clustering approach: 1. generation of standard Daylight's fingerprints (ASCII); 2. identification of potential cluster centroids; 3. clustering based on the exclusion spheres. 1. Generation of Fingerprints. Fingerprints for each molecule are generated, using Daylight software, as an ASCII string of 1's and 0's (fixed width at 1024). See Appendix 1 for more details on the concept of Daylight's fingerprints.2. Identifying Potential Cluster Centroids. It is reasonable to postulate that a molecule within a given cluster which has the largest number of neighbors and is therefore 'most like' the rest of the cluster is a good choice to become a cluster centroid. To identify such molecules, we calculate the number of neighbors for each molecule in the set, at the Tanimoto level chosen for the clustering. The set is then sorted in descending order, so that the potential cluster centroids, ...

show abstract

Estimation of Molecular Linear Free Energy Relation Descriptors Using a Group Contribution Approach

Platts

Butina

Abraham

et al. 1999

J. Chem. Inf. Comput. Sci.

420

300

View full text Add to dashboard Cite

Additive models for the estimation of Abraham's molecular descriptors R 2 , π 2 H , ΣR 2 H , Σβ 2 H , Σβ 2 O , and log L 16 have been developed. For five of the six descriptors, one set of 81 atom and functional group fragments is capable of reproducing experimentally derived results with correlation coefficients ranging from 0.95 to 0.99. However, one descriptor, ΣR 2 H , required an entirely separate set of 51 fragments to be developed, resulting in a correlation coefficient of 0.97. Of particular importance is the speed of calculation (approximately 700 molecules/min), allowing so-called "high-throughput screening". Several applications of this model for molecules containing intramolecular interactions are discussed.

show abstract

GR43175, a selective agonist for the 5‐HT₁‐like receptor in dog isolated saphenous vein

Humphrey

Feniuk

Perren

et al. 1988

British J Pharmacology

300

200

View full text Add to dashboard Cite

Correlation and prediction of a large blood–brain distribution data set—an LFER study

Platts

Abraham

Zhao

et al. 2001

European Journal of Medicinal Chemistry

191

160

View full text Add to dashboard Cite

Predicting ADME properties in silico: methods and models

Butina¹,

Segall

Frankcombe³

2002

Drug Discovery Today

233

143

View full text Add to dashboard Cite

Estimation of Molecular Linear Free Energy Relationship Descriptors by a Group Contribution Approach. 2. Prediction of Partition Coefficients

Platts

Abraham

Butina

et al. 1999

J. Chem. Inf. Comput. Sci.

156

View full text Add to dashboard Cite

A previously published method for the prediction of molecular linear free energy relationship descriptors is tested against experimentally determined partition coefficients in various solvent systems. Sets of partition data between water and octanol, cyclohexane, and chloroform were taken from the literature. For each set of partition data used, r2 values ranged from 0.8 to 0.9 and RMS errors from 0.7 to 1.0 log unit, comparable to errors obtained with previously published models for octanol-water partition. Modified solvation equations for water-octanol and water-cyclohexane partition are presented, and their implications discussed. The possibility of applying the current approach to a wide range of solvation and transport properties is put forward.

show abstract

Novel 2-D graphical representation of proteins

Randić

Butina

Zupan

2006

Chemical Physics Letters

View full text Add to dashboard Cite

A Rapid and Facile Method for the Dereplication of Purified Natural Products

et al. 2001

View full text Add to dashboard Cite

A new approach to the use of commercial databases for the dereplication of purified natural products has been developed. This is based on searching a text file that links each structure with its molecular weight and an exact count of the number of methyl, methylene, and methine groups it contains. Analysis of such a text file, constructed from a database containing more than 126,000 natural product structures, revealed that these data, readily measured using MS and NMR spectroscopy, are highly discriminating. The identification of an alkaloid and a sesquiterpene using this new approach is described.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Darko Butina

Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets

Estimation of Molecular Linear Free Energy Relation Descriptors Using a Group Contribution Approach

GR43175, a selective agonist for the 5‐HT₁‐like receptor in dog isolated saphenous vein

Correlation and prediction of a large blood–brain distribution data set—an LFER study

Predicting ADME properties in silico: methods and models

Estimation of Molecular Linear Free Energy Relationship Descriptors by a Group Contribution Approach. 2. Prediction of Partition Coefficients

Novel 2-D graphical representation of proteins

A Rapid and Facile Method for the Dereplication of Purified Natural Products

Contact Info

Product

Resources

About

Darko Butina

Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets

Estimation of Molecular Linear Free Energy Relation Descriptors Using a Group Contribution Approach

GR43175, a selective agonist for the 5‐HT1‐like receptor in dog isolated saphenous vein

Correlation and prediction of a large blood–brain distribution data set—an LFER study

Predicting ADME properties in silico: methods and models

Estimation of Molecular Linear Free Energy Relationship Descriptors by a Group Contribution Approach. 2. Prediction of Partition Coefficients

Novel 2-D graphical representation of proteins

A Rapid and Facile Method for the Dereplication of Purified Natural Products

Contact Info

Product

Resources

About

GR43175, a selective agonist for the 5‐HT₁‐like receptor in dog isolated saphenous vein