Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms

Maitra, Ranjan; Melnykov, Volodymyr

doi:10.1198/jcgs.2009.08054

Cited by 128 publications

(128 citation statements)

References 31 publications

Supporting

Mentioning

125

Contrasting

Unclassified

Order By: Relevance

“…The key result of Maitra and Melnykov (2010) is a closed expression for the probability of overlapping w j|i defined in Eq. (9), which is shown to be (for multivariate Gaussian mixtures) the cumulative distribution function (cdf) of a linear combination of non central χ 2 distributions U l with 1 degree of freedom plus a linear combination of W l ∼ N (0, 1) random variables:…”

Section: Simulating Regression Mixture Data With Mixsimregmentioning

confidence: 99%

“…The approach, known as MixSim (Maitra and Melnykov 2010;Melnykov et al 2012), was originally introduced in the multivariate context to generate samples from Gaussian mixture models G g=1 π g φ(y; μ g , Σ g ) defined in a v-variate space, for given data vector y, group occurrence probabilities (or mixing proportions) π g , group centroids μ g and group covariance matrices Σ g . If i and j (i = j = 1, ..., G) are clusters indexed by φ(y; μ i , Σ i ) and φ(y; μ j , Σ j ) with occurrence probabilities π i and π j , then the misclassification probability with respect to cluster i (i.e.…”

Section: Simulating Regression Mixture Data With Mixsimregmentioning

confidence: 99%

“…In order to control precisely the degree of overlap between the different regression hyperplanes of the generating mixture, we have extended MixSim to clusterwise regression. MixSim is a general, flexible and mathematically well founded framework originally introduced to generate mixtures of Gaussian distributions (Maitra and Melnykov 2010;Melnykov et al 2012). We have implemented the new simulation framework, MixSimReg, in MATLAB and made it available in the FSDA toolbox together with a previous implementation of the original multivariate counterpart, already presented in .…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Assessing trimming methodologies for clustering linear regression data

Torti

Perrotta

Riani

et al. 2018

Adv Data Anal Classif

View full text Add to dashboard Cite

We assess the performance of state-of-the-art robust clustering tools for regression structures under a variety of different data configurations. We focus on two methodologies that use trimming and restrictions on group scatters as their main ingredients. We also give particular care to the data generation process through the development of a flexible simulation tool for mixtures of regressions, where the user can control the degree of overlap between the groups. Level of trimming and restriction factors are input parameters for which appropriate tuning is required. Since we find that incorrect specification of the second-level trimming in the Trimmed CLUSTering REGression model (TCLUST-REG) can deteriorate the performance of the method, we propose an improvement where the second-level trimming is not fixed in advance but is data dependent. We then compare our adaptive version of TCLUST-REG with the Trimmed Cluster Weighted Restricted Model (TCWRM) which provides a powerful extension of the robust clusterwise regression methodology. Our overall conclusion is that the two methods perform comparably, but with notable differences due to the inherent degree of modeling implied by them.

show abstract

Section: Simulating Regression Mixture Data With Mixsimregmentioning

confidence: 99%

Section: Simulating Regression Mixture Data With Mixsimregmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Assessing trimming methodologies for clustering linear regression data

Torti

Perrotta

Riani

et al. 2018

Adv Data Anal Classif

View full text Add to dashboard Cite

show abstract

“…The overlap characteristics of mixtures obtained from the generator [22] were controlled by the two parameters: x specifying average pairwise overlap between components and x specifying maximum pairwise overlap. In the experiments, the number of components K was fixed at 20 and mixtures with dimension d 2 f2; 5; 10g were generated.…”

Section: Synthetic Datasetsmentioning

confidence: 99%

“…In the experiments with synthetic data, a generator recently proposed in [22] was employed which randomly generates Gaussian mixtures according to the user-defined overlap characteristics. The overlap x ij between two clusters i and j is defined as the sum of two misclassification probabilities x jji and x ijj where:…”

Section: Synthetic Datasetsmentioning

confidence: 99%

A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering

Kwedlo

2015

Pattern Anal Applic

View full text Add to dashboard Cite

The paper proposes a new method for initialization of the multiple restart EM algorithm for Gaussian mixture model-based clustering. The method initializes randomly both the mean vector and covariance matrix of a mixture component. In particular, the mean vector is initialized by a feature vector selected deterministically from a random subset of candidate feature vectors. The selection criterion is the maximum Mahalanobis distance from the already initialized mixture component centers. The covariance matrix of a component is initialized by randomly generating its eigenvalues and eigenvectors. In computational experiments, the used approach was compared with three other random EM initialization methods. The experiments were performed on synthetic datasets generated from the Gaussian mixtures with the different overlap characteristics, as well as on four real-life datasets. The results on synthetic data indicate that, for well separated clusters, for which the maximum pairwise overlap is not excessively high, the described method yields clusterings which correspond better to the original partitions of data, as indicated by the adjusted Rand index. The experiments on real data indicate that the performance of the method is comparable to other three methods for two smaller datasets and significantly better for two larger datasets.Keywords Gaussian mixture models Á EM algorithm initialization Á Model-based clustering Á Multiple restart EM

show abstract

Network‐based semisupervised clustering

Frigau

Contu

Molà

et al. 2021

Appl Stoch Models Bus & Ind

View full text Add to dashboard Cite

Semisupervised clustering extends standard clustering methods to the semisupervised setting, in some cases considering situations when clusters are associated with a given outcome variable that acts as a “noisy surrogate,” that is a good proxy of the unknown clustering structure. In this article, a novel approach to semisupervised clustering associated with an outcome variable named network‐based semisupervised clustering (NeSSC) is introduced. It combines an initialization, a training and an agglomeration phase. In the initialization and training a matrix of pairwise affinity of the instances is estimated by a classifier. In the agglomeration phase the matrix of pairwise affinity is transformed into a complex network, in which a community detection algorithm searches the underlying community structure. Thus, a partition of the instances into clusters highly homogeneous in terms of the outcome is obtained. We consider a particular specification of NeSSC that uses classification or regression trees as classifiers and the Louvain, Label propagation and Walktrap as possible community detection algorithm. NeSSC's stopping criterion and the choice of the optimal partition of the original data are also discussed. Several applications on both real and simulated data are presented to demonstrate the effectiveness of the proposed semisupervised clustering method and the benefits it provides in terms of improved interpretability of results with respect to three alternative semisupervised clustering methods.

show abstract

Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms

Cited by 128 publications

References 31 publications

Assessing trimming methodologies for clustering linear regression data

Assessing trimming methodologies for clustering linear regression data

A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering

Network‐based semisupervised clustering

Contact Info

Product

Resources

About