Algorithm AS 136: A K-Means Clustering Algorithm

Hartigan, John; Wong, M. Anthony

doi:10.2307/2346830

Cited by 9,902 publications

(6,484 citation statements)

References 3 publications

Supporting

Mentioning

6,007

Contrasting

Unclassified

108

Order By: Relevance

“…The clustering algorithm we use is the K-means algorithm, as implemented by Hartigan and Wong. 29,30 A particularly salient feature of this algorithm, for our purposes, is that the computational expense scales linearly with the number of items (loops) to cluster. However, unlike clustering algorithms that calculate similarity between every pair of items (and thus scale quadratically), the number of clusters must be specified in advance for K-means.…”

Section: Clusteringmentioning

confidence: 99%

A hierarchical approach to all‐atom protein loop prediction

et al. 2004

View full text Add to dashboard Cite

The application of all-atom force fields (and explicit or implicit solvent models) to protein homology-modeling tasks such as side-chain and loop prediction remains challenging both because of the expense of the individual energy calculations and because of the difficulty of sampling the rugged all-atom energy surface. Here we address this challenge for the problem of loop prediction through the development of numerous new algorithms, with an emphasis on multiscale and hierarchical techniques. As a first step in evaluating the performance of our loop prediction algorithm, we have applied it to the problem of reconstructing loops in native structures; we also explicitly include crystal packing to provide a fair comparison with crystal structures. In brief, large numbers of loops are generated by using a dihedral angle-based buildup procedure followed by iterative cycles of clustering, side-chain optimization, and complete energy minimization of selected loop structures. We evaluate this method by using the largest test set yet used for validation of a loop prediction method, with a total of 833 loops ranging from 4 to 12 residues in length. Average/median backbone rootmean-square deviations (RMSDs) to the native structures (superimposing the body of the protein, not the loop itself) are 0.42/0.24 Å for 5 residue loops, 1.00/0.44 Å for 8 residue loops, and 2.47/1.83 Å for 11 residue loops. Median RMSDs are substantially lower than the averages because of a small number of outliers; the causes of these failures are examined in some detail, and many can be attributed to errors in assignment of protonation states of titratable residues, omission of ligands from the simulation, and, in a few cases, probable errors in the experimentally determined structures. When these obvious problems in the data sets are filtered out, average RMSDs to the native structures improve to 0.43 Å for 5 residue loops, 0.84 Å for 8 residue loops, and 1.63 Å for 11 residue loops. In the vast majority of cases, the method locates energy minima that are lower than or equal to that of the minimized native loop, thus indicating that sampling rarely limits prediction accuracy. The overall results are, to our knowledge, the best reported to date, and we attribute this success to the combination of an accurate all-atom energy function, efficient methods for loop buildup and side-chain optimization, and, especially for the longer loops, the hierarchical refinement protocol.

show abstract

Section: Clusteringmentioning

confidence: 99%

A hierarchical approach to all‐atom protein loop prediction

et al. 2004

View full text Add to dashboard Cite

show abstract

“…In addition, each cluster is represented by a centre point with properties which are assumed to be representative of all catchments in the cluster. The k-means clustering algorithm (Hartigan and Wong, 1979) described in the next section is used to assign the catchments into nc clusters and also to calculate a centre vector of the spatial variables for each cluster. In addition, a standard deviation vector for each cluster can be calculated using the resulting centre vector and the vectors of the spatial data for all catchments in the cluster.…”

Section: New Neuro-fuzzy National P Export Modelmentioning

confidence: 99%

“…Some defined distance measure such as the Euclidean distance is often used to determine proximity of the data in a cluster. The k-means clustering algorithm (Hartigan and Wong, 1979) is one of the simplest unsupervised learning algorithms for this partitioning when the number of clusters (k) is known a priori. The number of clusters is normally determined based on the amount and characteristics of the data which is used in calibrating the model.…”

Section: K-means Clustering Algorithmmentioning

confidence: 99%

Derivation of a fuzzy national phosphorus export model using 84 Irish catchments

Nasr¹,

Bruen

2013

Science of The Total Environment

View full text Add to dashboard Cite

► Develops a new national phosphorus export model for agricultural catchments in Ireland ► Improves on earlier empirical phosphorus export models by using k-means clustering method for partitioning data ► Uses ANFIS model to predict annual average ortho-phosphorus concentrations using catchment characteristics ► Phosphorus desorption index (PDI) and runoff risk index (RRI) are essential predictors in the model. a b s t r a c t a r t i c l e i n f o Implementation of appropriate management strategies to mitigate diffuse phosphorus (P) pollution at the catchment scale is vitally important for the sustainable development of water resources in Ireland. An important element in the process of implementing such strategies is the prediction of their impacts on P concentrations in a catchment using a reliable mathematical model. In this study, a state-of-the-art adaptive neuro-fuzzy inference system (ANFIS) has been used to develop a new national P model capable of estimating average annual ortho-P concentrations at un-gauged catchments. Data from 84 catchments dominated by diffuse P pollution were used in developing and testing the model. Six different split-sample scenarios were used to partition the total number of the catchments into two sets, one to calibrate and the other to validate the model. The k-means clustering algorithm was used to partition the sets into clusters of catchments with similar features. Then for each scenario and for each cluster case, 11 different models, each of which consists of a linear regression sub-model for each cluster, were formulated by using different input variables selected from among six spatially distributed variables including phosphorus desorption index (PDI), runoff risk index (RRI), geology (GEO), groundwater (GW), land use (LU), and soil (SO). The success of the new approach over the conventional lumped, empirical, modelling approach was evident from the improved results obtained for most of the cases. In addition the results highlighted the importance of using information on PDI and RRI as explanatory input variables to simulate the average annual ortho-P concentrations.

show abstract

“…K-mean clustering is a statistical method that partitions a given data set into a specific number of clusters in which each data belongs to the cluster with the nearest mean. For a more detailed algorithm, refer to (Hartigan and Wong 1979).…”

Section: K-mean Clustering Algorithmmentioning

confidence: 99%

Analysis of genomic characters reveals that four distinct gene clusters are correlated with different functions in Burkholderia cenocepacia AU 1054

Yuan

Yang

Ren

et al. 2013

Appl Microbiol Biotechnol

View full text Add to dashboard Cite

Possessing three circular chromosomes is a distinct genomic characteristic of Burkholderia cenocepacia AU 1054, a clinically important pathogen in cystic fibrosis. In this study, base composition, codon usage and functional role category were analyzed in the B. cenocepacia AU 1054 genome. Although no bias in the base and codon usage was detected between any two chromosomes, function differences did exist in the genes of each chromosome. Similar base composition and differential functional role categories indicated that genes on these three chromosomes were relatively stable and that a proper division of labor was established. Based on variations in the base or codon usage, four small gene clusters were observed in all of the genes. Multivariate analysis revealed that protein hydrophobicity played a predominant role in shaping base usage bias, while horizontal gene transfer and the gene expression level were the two most important factors that affected the codon usage bias. Interestingly, we also found that these gene clusters were correlated with different biological functions: (i) 45 pyrimidine-leadingcodon preferred genes were predominantly involved in regulatory function; (ii) most drug resistance-related genes involved in 826 genes that coding for hydrophobic proteins; (iii) most of the 111 horizontal transfer genes were responsible for genomic plasticity; and (iv) 73 highly expressed genes (predicted by their codon adaptation index values) showed environmental adaptation to cystic fibrosis. Our results showed that genes with base or codon usage bias were affected by mutational pressure and natural selection, and their functions could contribute to drug assistance and transmissible activity in B. cenocepacia.

show abstract

Algorithm AS 136: A K-Means Clustering Algorithm

Cited by 9,902 publications

References 3 publications

A hierarchical approach to all‐atom protein loop prediction

A hierarchical approach to all‐atom protein loop prediction

Derivation of a fuzzy national phosphorus export model using 84 Irish catchments

Analysis of genomic characters reveals that four distinct gene clusters are correlated with different functions in Burkholderia cenocepacia AU 1054

Contact Info

Product

Resources

About