CLoNe: automated clustering based on local density neighborhoods for application to biomolecular structural ensembles

Träger, Sylvain; Tamò, Giorgio E.; Aydin, Deniz; Fonti, Giulia; Audagnotto, Martina; Peraro, Matteo Dal

doi:10.1093/bioinformatics/btaa742

Cited by 9 publications

(18 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the algorithm is fairly robust to cutoff choice, a list-position based cutoff may present issues with clusters of varying densities. 35 In order to include information from all data points, while minimizing user input, for all segment-based clustering trials, the kernel density estimator cutoff was set to the average distance to the ln( N )-th nearest neighbor, where N is the number of trajectory segments considered. This choice of cutoff was motivated by the idea that the number of nearest neighbors k ( N ) must adapt to the underlying data distribution as the number of samples N → ∞.…”

Section: Methodsmentioning

confidence: 99%

“…In the original implementation, this value is set to a fixed (second) percentile of the sorted list of pairwise distances. While the algorithm is fairly robust to cutoff choice, a list-position based cutoff may present issues with clusters of varying densities . In order to include information from all data points, while minimizing user input, for all segment-based clustering trials, the kernel density estimator cutoff was set to the average distance to the ln( N )-th nearest neighbor, where N is the number of trajectory segments considered.…”

Section: Methodsmentioning

confidence: 99%

“… 28 − 34 Limitations of this method previously identified by the scientific community include the need for the user to specify the cutoff distance for the kernel density estimator and the need for the user to visually inspect the generated decision graph and manually assign cluster centroids as well as quadratic memory complexity. 35 , 36 The last issue in particular can make memory requirements for a typical MD data set balloon to hundreds of gigabytes, necessitating the use of expensive high-end hardware. This problem can be mitigated by recomputing the pairwise distances as needed, rather than storing them (which trades memory for computational complexity) or using local approximations for density estimation.…”

Section: Introductionmentioning

confidence: 99%

“… 36 Later implementations of the method may be run on large data sets on regular desktop machines. 37 Several other groups have also proposed extensions of the method that address the aforementioned shortcomings; 35 , 36 however, none have, to our knowledge, entirely eliminated user input or reduced memory complexity without computational trade-off or the use of approximations.…”

Section: Introductionmentioning

confidence: 99%

“…This method relies on the observation that cluster centroids exhibit a relatively high local density compared to their neighbors and a large distance from any points of higher density. This method has proven competent at handling clusters of varying shapes, sizes, and densities and has already been applied effectively to MD data sets. − Limitations of this method previously identified by the scientific community include the need for the user to specify the cutoff distance for the kernel density estimator and the need for the user to visually inspect the generated decision graph and manually assign cluster centroids as well as quadratic memory complexity. , The last issue in particular can make memory requirements for a typical MD data set balloon to hundreds of gigabytes, necessitating the use of expensive high-end hardware. This problem can be mitigated by recomputing the pairwise distances as needed, rather than storing them (which trades memory for computational complexity) or using local approximations for density estimation .…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

CATBOSS: Cluster Analysis of Trajectories Based on Segment Splitting

Damjanovic

Murphy

Lin

2021

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Molecular dynamics (MD) simulations are an exceedingly and increasingly potent tool for molecular behavior prediction and analysis. However, the enormous wealth of data generated by these simulations can be difficult to process and render in a human-readable fashion. Cluster analysis is a commonly used way to partition data into structurally distinct states. We present a method that improves on the state of the art by taking advantage of the temporal information of MD trajectories to enable more accurate clustering at a lower memory cost. To date, cluster analysis of MD simulations has generally treated simulation snapshots as a mere collection of independent data points and attempted to separate them into different clusters based on structural similarity. This new method, cluster analysis of trajectories based on segment splitting (CATBOSS), applies density-peak-based clustering to classify trajectory segments learned by change detection. Applying the method to a synthetic toy model as well as four real-life data sets–trajectories of MD simulations of alanine dipeptide and valine dipeptide as well as two fast-folding proteins–we find CATBOSS to be robust and highly performant, yielding natural-looking cluster boundaries and greatly improving clustering resolution. As the classification of points into segments emphasizes density gaps in the data by grouping them close to the state means, CATBOSS applied to the valine dipeptide system is even able to account for a degree of freedom deliberately omitted from the input data set. We also demonstrate the potential utility of CATBOSS in distinguishing metastable states from transition segments as well as promising application to cases where there is little or no advance knowledge of intrinsic coordinates, making for a highly versatile analysis tool.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

CATBOSS: Cluster Analysis of Trajectories Based on Segment Splitting

Damjanovic

Murphy

Lin

2021

J. Chem. Inf. Model.

View full text Add to dashboard Cite

show abstract

Synthesis of new non-natural l-glycosidic flavonoid derivatives and their evaluation as inhibitors of Trypanosoma cruzi ecto-nucleoside triphosphate diphosphohydrolase 1 (TcNTPDase1)

Ribeiro,

de Moraes,

Mariotini-Moura

et al. 2023

Purinergic Signalling

View full text Add to dashboard Cite

Size-and-Shape Space Gaussian Mixture Models for Structural Clustering of Molecular Dynamics Trajectories

Klem

Hocky

McCullagh

2022

J. Chem. Theory Comput.

View full text Add to dashboard Cite

Determining the optimal number and identity of structural clusters from an ensemble of molecular configurations continues to be a challenge. Recent structural clustering methods have focused on the use of internal coordinates due to the innate rotational and translational invariance of these features. The vast number of possible internal coordinates necessitates a feature space supervision step to make clustering tractable but yields a protocol that can be system type-specific. Particle positions offer an appealing alternative to internal coordinates but suffer from a lack of rotational and translational invariance, as well as a perceived insensitivity to regions of structural dissimilarity. Here, we present a method, denoted shape-GMM, that overcomes the shortcomings of particle positions using a weighted maximum likelihood alignment procedure. This alignment strategy is then built into an expectation maximization Gaussian mixture model (GMM) procedure to capture metastable states in the free-energy landscape. The resulting algorithm distinguishes between a variety of different structures, including those indistinguishable by root-mean-square displacement and pairwise distances, as demonstrated on several model systems. Shape-GMM results on an extensive simulation of the fast-folding HP35 Nle/Nle mutant protein support a four-state folding/unfolding mechanism, which is consistent with previous experimental results and provides kinetic details comparable to previous state-of-the art clustering approaches, as measured by the VAMP-2 score. Currently, training of shape-GMMs is recommended for systems (or subsystems) that can be represented by ≲200 particles and ≲100k configurations to estimate high-dimensional covariance matrices and balance computational expense. Once a shape-GMM is trained, it can be used to predict the cluster identities of millions of configurations.

show abstract

CLoNe: automated clustering based on local density neighborhoods for application to biomolecular structural ensembles

Cited by 9 publications

References 60 publications

CATBOSS: Cluster Analysis of Trajectories Based on Segment Splitting

CATBOSS: Cluster Analysis of Trajectories Based on Segment Splitting

Synthesis of new non-natural l-glycosidic flavonoid derivatives and their evaluation as inhibitors of Trypanosoma cruzi ecto-nucleoside triphosphate diphosphohydrolase 1 (TcNTPDase1)

Size-and-Shape Space Gaussian Mixture Models for Structural Clustering of Molecular Dynamics Trajectories

Contact Info

Product

Resources

About