Clustering Categorical Data via Ensembling Dissimilarity Matrices

Amiri, Saeid; Clarke, Bertrand; Clarke, Jennifer

doi:10.1080/10618600.2017.1305278

Cited by 15 publications

(29 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In general, the clusters in different  ( ) 's do not correspond in a one-to-one manner according to their indices. For instances, (2) 's may be (1) 's with permuted subscript k. To further complicate matters, even permutation may not be a suitable relationship.…”

Section: Consider a Set Of Instancesmentioning

confidence: 99%

“…1 Cluster alignment: Let  ( ) = { (1) 1 , (1) 2 , … , (1) }, = 1, … , . The number of clusters K t varies.…”

Section: Consider a Set Of Instancesmentioning

confidence: 99%

“…Even if K t 's are the same, permutation may not determine the right correspondence between the clusters. For instance, if (1) 1 is split into (2) 1 and (2) 2 , while (1) 2 and (1) 3 are merged into (2) 3 ,  (1) and  (2) still have the same number of clusters. To encode a general correspondence between clusters, we propose a soft cluster aligning matrix…”

Section: Consider a Set Of Instancesmentioning

confidence: 99%

“…We will present the method for solving W in the next section. 2 Cluster mapping matrix: Let us take  (1) as the reference partition and attempt to map each (2) ∈  (2) to clusters in  (1) . To mathematically describe the mapping, let i, j be a weight that indicates the extent (2) is mapped to (1) .…”

Section: Consider a Set Of Instancesmentioning

confidence: 99%

“…One can easily check that P (2 → 1) = P (1) . In the case of permutation, Equation (1) is essentially to impose the inverse permutation.…”

Section: Consider a Set Of Instancesmentioning

confidence: 99%

See 4 more Smart Citations

Optimal transport, mean partition, and uncertainty assessment in cluster analysis

Seo

Lin

2019

Statistical Analysis

View full text Add to dashboard Cite

In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean, and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one‐to‐one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real data sets are provided. The corresponding R package OTclust is available on CRAN.

show abstract

Section: Consider a Set Of Instancesmentioning

confidence: 99%

“…1 Cluster alignment: Let  ( ) = { (1) 1 , (1) 2 , … , (1) }, = 1, … , . The number of clusters K t varies.…”

Section: Consider a Set Of Instancesmentioning

confidence: 99%

Section: Consider a Set Of Instancesmentioning

confidence: 99%

Section: Consider a Set Of Instancesmentioning

confidence: 99%

“…One can easily check that P (2 → 1) = P (1) . In the case of permutation, Equation (1) is essentially to impose the inverse permutation.…”

Section: Consider a Set Of Instancesmentioning

confidence: 99%

See 3 more Smart Citations

Optimal transport, mean partition, and uncertainty assessment in cluster analysis

Seo

Lin

2019

Statistical Analysis

View full text Add to dashboard Cite

show abstract

Unsupervised and Semisupervised Learning

Pisztora

2021

Wiley StatsRef: Statistics Reference Online

View full text Add to dashboard Cite

Methodologies for unsupervised and semisupervised learning are reviewed. For unsupervised learning, or clustering, the focus is on mixture‐model‐based approaches under both the classic and mode association frameworks. High‐dimensional data pose a major challenge for clustering. We thus discuss in detail variable selection and the hidden Markov model on variable blocks, which exploits a graph structure to simplify the dependence among variables. We also present topics that emerged relatively recently such as clustering distributional data under the Wasserstein metric and uncertainty assessment for cluster analysis. Semisupervised learning has attracted growing interest in the machine learning community in recent years. We review foundational approaches including self‐training, semisupervised generative models, and graphical models. We then describe in greater depth entropy minimization, consistency regularization, and mixup augmentation, methods that are utilized in state‐of‐the‐art models such as MixMatch.

show abstract

Search for relevant subsets of binary predictors in high dimensional regression for discovering the lead molecule

Mameli

Slanzi

Poli

et al. 2021

Pharmaceutical Statistics

View full text Add to dashboard Cite

One of the main problems that the drug discovery research field confronts is to identify small molecules, modulators of protein function, which are likely to be therapeutically useful. Common practices rely on the screening of vast libraries of small molecules (often 1–2 million molecules) in order to identify a molecule, known as a lead molecule, which specifically inhibits or activates the protein function. To search for the lead molecule, we investigate the molecular structure, which generally consists of an extremely large number of fragments. Presence or absence of particular fragments, or groups of fragments, can strongly affect molecular properties. We study the relationship between molecular properties and its fragment composition by building a regression model, in which predictors, represented by binary variables indicating the presence or absence of fragments, are grouped in subsets and a bi‐level penalization term is introduced for the high dimensionality of the problem. We evaluate the performance of this model in two simulation studies, comparing different penalization terms and different clustering techniques to derive the best predictor subsets structure. Both studies are characterized by small sets of data relative to the number of predictors under consideration. From the results of these simulation studies, we show that our approach can generate models able to identify key features and provide accurate predictions. The good performance of these models is then exhibited with real data about the MMP–12 enzyme.

show abstract

Clustering Categorical Data via Ensembling Dissimilarity Matrices

Cited by 15 publications

References 46 publications

Optimal transport, mean partition, and uncertainty assessment in cluster analysis

Optimal transport, mean partition, and uncertainty assessment in cluster analysis

Unsupervised and Semisupervised Learning

Search for relevant subsets of binary predictors in high dimensional regression for discovering the lead molecule

Contact Info

Product

Resources

About