2022
DOI: 10.1021/acs.jcim.2c00265
|View full text |Cite
|
Sign up to set email alerts
|

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules

Abstract: The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 13 publications
(15 citation statements)
references
References 37 publications
(50 reference statements)
0
14
0
Order By: Relevance
“…The uniform manifold approximation and projection (UMAP) is a non-linear dimensionality reduction algorithm that seeks to learn the manifold structure of the data and find a low-dimensional embedding while preserving the essential topological structure of that manifold [ 29 ]. While UMAP has been used for dimensionality reduction [ 30 ], it has also been used for clustering [ 11 ]. UMAP has four basic parameters to control the impact on the resulting embedding.…”
Section: Materials and Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The uniform manifold approximation and projection (UMAP) is a non-linear dimensionality reduction algorithm that seeks to learn the manifold structure of the data and find a low-dimensional embedding while preserving the essential topological structure of that manifold [ 29 ]. While UMAP has been used for dimensionality reduction [ 30 ], it has also been used for clustering [ 11 ]. UMAP has four basic parameters to control the impact on the resulting embedding.…”
Section: Materials and Methodsmentioning
confidence: 99%
“…The clustering of biological entities, such as small molecules, can be performed using different approaches, including hierarchical clustering (HC), distribution-based clustering, and density-based clustering [ 10 ]. For example, several clustering algorithms, including hierarchical, Taylor–Butina, and UMAP clustering, have been compared on 29 data sets with between 100 and 5000 small molecules [ 11 ]. In addition, hierarchical clustering has been used to cluster molecules from the PubChem database [ 12 ], and Taylor–Butina clustering has been used to cluster molecules from the MolPort database [ 13 ].…”
Section: Introductionmentioning
confidence: 99%
“…The dimensionality was reduced by performing the PCA. The process is based on the principle of feature bagging ( Prada Gori et al, 2022 ). The conventional feature extraction and data representation method used extensively in the fields of pattern recognition is principal component analysis (PCA), generally called as Karhunen-Loeve expansion.…”
Section: Methodsmentioning
confidence: 99%
“…The molecules were clustered using Silhouette Optimized Molecular Clustering (SOMoC), an in-house clustering method. Briefly, SOMoC can be described as a sequential combination of molecular fingerprinting, dimensionality reduction using the Uniform Manifold Approximation and Projection (UMAP) algorithm, and clustering using the Gaussian Mixture Model (GMM) algorithm . Training (80%) and test (20%) partitions were then sampled in a stratified manner using the cluster assignments seeking to maintain similar distributions for the dependent variable (pIC 50 ).…”
Section: Methodsmentioning
confidence: 99%