Clustering of multivariate binary data with dimension reduction via L1-regularized likelihood maximization

Yamamoto, Michio; Hayashi, Kenichi

doi:10.1016/j.patcog.2015.05.026

“…This can occur in pangenomics as the discovery rate of new families in the pangenome slightly decreases when new genomes are added. Mathematical solutions to this problem seem to exist [50][51][52] for example via the weighting of genomes (based on their respective contribution to the pangenome diversity) or via sparse partitioning methods. An improvement of NEM should include these solutions and could be a perspective of this work.…”

Section: Issues Resulting From High-dimensional Statistics and Parallmentioning

confidence: 99%

PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph

Gautreau

¹

,

Bazin

²

,

Gachet

³

et al. 2020

View full text Add to dashboard Cite

The use of comparative genomics for functional, evolutionary, and epidemiological studies requires methods to classify gene families in terms of occurrence in a given species. These methods usually lack multivariate statistical models to infer the partitions and the optimal number of classes and don't account for genome organization. We introduce a graph structure to model pangenomes in which nodes represent gene families and edges represent genomic neighborhood. Our method, named PPanGGOLiN, partitions nodes using an Expectation-Maximization algorithm based on multivariate Bernoulli Mixture Model coupled with a Markov Random Field. This approach takes into account the topology of the graph and the presence/absence of genes in pangenomes to classify gene families into persistent, cloud, and one or several shell partitions. By analyzing the partitioned pangenome graphs of isolate genomes from 439 species and metagenome-assembled genomes from 78 species, we demonstrate that our method is effective in estimating the persistent genome. Interestingly, it shows that the shell genome is a key element to understand genome dynamics, presumably because it reflects how genes present at intermediate frequencies drive adaptation of species, and its proportion in genomes is independent of genome size. The graph-based approach proposed by PPanGGOLiN is useful to depict the overall genomic diversity of thousands of strains in a compact structure and provides an effective basis for very large scale comparative genomics. The software is freely available at https://github.com/labgem/ PPanGGOLiN. PLOS COMPUTATIONAL BIOLOGYPLOS Computational Biology | https://doi.org/10.

show abstract

“…Actually, it can be the case in pangenomics as the number of new families added to the pangenome slightly decreases when new genomes are added (see figure 3 in [1]). Mathematical solutions to this issue seem to exist [46,47,48] for example via the weighting of features, corresponding to the weighting of genomes in our case. An improved version of NEM should include this improvement and could be perspective of this work.…”

Section: Issues Resulting From High-dimensional Statisticsmentioning

confidence: 98%

PPanGGOLiN: depicting microbial diversity via a partitioned pangenome graph

Gautreau

¹

,

Bazin

²

,

Gachet

³

et al. 2019

Preprint

View full text Add to dashboard Cite

The use of comparative genomics for functional, evolutionary, and epidemiological studies requires methods to classify gene families in terms of occurrence in a given species. These methods usually lack multivariate statistical models to infer the partitions and the optimal number of classes and don't account for genome organization. We introduce a graph structure to model pangenomes in which nodes represent gene families and edges represent genomic neighborhood. Our method, named PPanGGOLiN, partitions nodes using an Expectation-Maximization algorithm based on multivariate Bernoulli Mixture Model coupled with a Markov Random Field. This approach takes into account the topology of the graph and the presence/absence of genes in pangenomes to classify gene families into persistent, cloud, and one or several shell partitions. By analyzing the partitioned pangenome graphs of isolate genomes from 439 species and metagenome-assembled genomes from 78 species, we demonstrate that our method is effective in estimating the persistent genome. Interestingly, it shows that the shell genome is a key element to understand genome dynamics, presumably because it reflects how genes present at intermediate frequencies drive adaptation of species, and its proportion in genomes is independent of genome size. The graph-based approach proposed by PPanGGOLiN is useful to depict the overall genomic diversity of thousands of strains in a compact structure and provides an effective basis for very large scale comparative genomics. The software is freely available at https://github.com/labgem/PPanGGOLiN.

show abstract

“…Since many attributes are usually statistically irrelevant and independent of true categories, they may be removed or associated with small weights (Graham and Miller 2006;Bouguila 2010). This partially links mixture models with subspace clustering of discrete data (Yamamoto and Hayashi 2015;Chen et al 2016). Since the use of multinomial distributions formally requires an independence of attributes, different smoothing techniques were proposed, such as applying Dirichlet distributions as a prior to the multinomial (Bouguila and ElGuebaly 2009).…”

Section: Model-based Techniquesmentioning

confidence: 99%

Efficient mixture model for clustering of sparse high dimensional binary data

Śmieja

¹

,

Hajto

²

,

Tabor

³

2019

Data Min Knowl Disc

View full text Add to dashboard Cite

Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.

show abstract

Clustering of multivariate binary data with dimension reduction via L1-regularized likelihood maximization

Cited by 17 publications

References 35 publications

PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph

PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph

PPanGGOLiN: depicting microbial diversity via a partitioned pangenome graph

Efficient mixture model for clustering of sparse high dimensional binary data

Contact Info

Product

Resources

About