Minimum Sum-Squared Residue Co-clustering of Gene Expression Data

Cho, Hyuk; Dhillon, Inderjit S.; Guan, Yuqiang; Sra, Suvrit

doi:10.1137/1.9781611972740.11

Cited by 232 publications

(233 citation statements)

References 20 publications

Supporting

Mentioning

229

Contrasting

Order By: Relevance

“…Co-clustering has been studied in many different application contexts including text mining [11], gene expression analysis [8,27] and graph mining [5] where these methods have yielded an impressive improvement in performance over traditional clustering techniques. The methods differ primarily by the criterion they optimize, such as minimum loss in mutual information [11], sum-squared distance [8], minimum description length (MDL) [5], Bregman divergence [2] and non-parametric association measures [29,17].…”

Section: Related Workmentioning

confidence: 99%

“…The methods differ primarily by the criterion they optimize, such as minimum loss in mutual information [11], sum-squared distance [8], minimum description length (MDL) [5], Bregman divergence [2] and non-parametric association measures [29,17]. Among these approaches, only those ones based on MDL and association measure are claimed to be parameter-free [19].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Parameter-less co-clustering for star-structured heterogeneous data

Ienco

Robardet

Pensa

et al. 2012

Data Min Knowl Disc

View full text Add to dashboard Cite

The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. In this paper we present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman-Kruskal's τ , a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables. We extend τ to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes τ by a local search approach. We assess the performance of CoStar on publicly available datasets from the textual and image domains using objective external criteria. The results show that our approach outperforms state-of-the-art methods for the co-clustering of heterogeneous data, while it remains computationally efficient.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Parameter-less co-clustering for star-structured heterogeneous data

Ienco

Robardet

Pensa

et al. 2012

Data Min Knowl Disc

View full text Add to dashboard Cite

show abstract

“…Similar ideas have been explored in for gene expression data in [4]. Here we illustrated in our clustering model.…”

Section: The Optimization Proceduresmentioning

confidence: 88%

A Clustering Model Based on Matrix Approximation with Applications to Cluster System Log Files

Peng

2005

Machine Learning: ECML 2005

View full text Add to dashboard Cite

Abstract. In system management applications, to perform automated analysis of the historical data across multiple components when problems occur, we need to cluster the log messages with disparate formats to automatically infer the common set of semantic situations and obtain a brief description for each situation. In this paper, we propose a clustering model where the problem of clustering is formulated as matrix approximations and the clustering objective is minimizing the approximation error between the original data matrix and the reconstructed matrix based on the cluster structures. The model explicitly characterizes the data and feature memberships and thus enables the descriptions of each cluster. We present a two-side spectral relaxation optimization procedure for the clustering model. We also establish the connections between our clustering model with existing approaches. Experimental results show the effectiveness of the proposed approach.

show abstract

“…Various co-clustering algorithms have adopted different error functions, such as minimum mutual information [15], sum-squared distance [9], and code length [6]. A general co-clustering framework based on Bregman divergence [4] has been proposed, which covers the entire exponential family.…”

Section: Definitions and Overviewmentioning

confidence: 99%

DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

Papadimitriou

Sun

2008

2008 Eighth IEEE International Conference on Data Mining

157

View full text Add to dashboard Cite

Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models.As data become more abundant, scalable and easyto-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, MapReduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary.In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the Distributed Co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.

show abstract

Minimum Sum-Squared Residue Co-clustering of Gene Expression Data

Cited by 232 publications

References 20 publications

Parameter-less co-clustering for star-structured heterogeneous data

Parameter-less co-clustering for star-structured heterogeneous data

A Clustering Model Based on Matrix Approximation with Applications to Cluster System Log Files

DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

Contact Info

Product

Resources

About