Clustered subset selection and its applications on it service metrics

Boutsidis, Christos; Sun, Jimeng; Anerousis, Nikos

doi:10.1145/1458082.1458162

Cited by 12 publications

(29 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Boutsidis et al [6] use clustering in the setting of column subset selection problem. Column subset selection problems are important in large scale computations when the data matrix A is streaming in a way that it is impossible or impractical to store it entirely.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Clustered low rank approximation of graphs in information science applications

Savas¹,

Dhillon²

2011

Proceedings of the 2011 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

In this paper we present a fast and accurate procedure called clustered low rank matrix approximation for massive graphs. The procedure involves a fast clustering of the graph and then approximates each cluster separately using existing methods, e.g. the singular value decomposition, or stochastic algorithms. The cluster-wise approximations are then extended to approximate the entire graph. This approach has several benefits: (1) important community structure of the graph is preserved due to the clustering; (2) highly accurate low rank approximations are achieved; (3) the procedure is efficient both in terms of computational speed and memory usage; (4) better performance in problems from various applications compared to standard low rank approximation. Further, we generalize stochastic algorithms to the clustered low rank approximation framework and present theoretical bounds for the approximation error. Finally, a set of experiments, using large scale and real-world graphs, show that our methods outperform standard low rank matrix approximation algorithms.

show abstract

Section: Related Workmentioning

confidence: 99%

“…2 Ω, where p is a small oversampling parameter (typically set to [5][6][7][8][9][10]. Multiplying A with the random matrix Ω we obtain Y = AΩ.…”

Section: Introductionmentioning

confidence: 99%

Clustered low rank approximation of graphs in information science applications

Savas¹,

Dhillon²

2011

Proceedings of the 2011 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

show abstract

“…5 Ω, where p is a small oversampling parameter (typically set to [5][6][7][8][9][10]. Multiplying A with the random matrix Ω, we obtain Y = AΩ.…”

Section: Preliminariesmentioning

confidence: 99%

“…Necessary and sufficient conditions are given in order to reconstruct the truncated SVD of the original data matrix A from the truncated SVDs of its block-column-wise partitioning. Boutsidis, Sun, and Anerousis [8] use clustering for the column subset selection problem when dealing with streaming matrix data. In a similar approach with data sampling, Zhang and Kwok [44] present a clustered Nyström method for large scale manifold learning applications, where the authors approximate the eigenfunctions of associated integral equation kernels.…”

Section: Principal Angles Assume We Have a Truncated Svd Approximatimentioning

confidence: 99%

See 1 more Smart Citation

Clustered Matrix Approximation

Savas¹,

Dhillon²

2016

SIAM J. Matrix Anal. & Appl.

View full text Add to dashboard Cite

Abstract. In this paper we develop a novel clustered matrix approximation framework, first showing the motivation behind our research. The proposed methods are particularly well suited for problems with large scale sparse matrices that represent graphs and/or bipartite graphs from information science applications. Our framework and resulting approximations have a number of benefits: (1) the approximations preserve important structure that is present in the original matrix; (2) the approximations contain both global-scale and local-scale information; (3) the procedure is efficient both in computational speed and memory usage; and (4) the resulting approximations are considerably more accurate with less memory usage than truncated SVD approximations, which are optimal with respect to rank. The framework is also quite flexible as it may be modified in various ways to fit the needs of a particular application. In the paper we also derive a probabilistic approach that uses randomness to compute a clustered matrix approximation within the developed framework. We further prove deterministic and probabilistic bounds of the resulting approximation error. Finally, in a series of experiments we evaluate, analyze, and discuss various aspects of the proposed framework. In particular, all the benefits we claim for the clustered matrix approximation are clearly illustrated using real-world and large scale data.

show abstract

Greedy column subset selection for large-scale data sets

et al. 2014

View full text Add to dashboard Cite

In today's information systems, the availability of massive amounts of data necessitates the development of fast and accurate algorithms to summarize these data and represent them in a succinct format. One crucial problem in big data analytics is the selection of representative instances from large and massively distributed data, which is formally known as the Column Subset Selection problem. The solution to this problem enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. This paper presents a fast and accurate greedy algorithm for large-scale column subset selection. The algorithm minimizes an objective function, which measures the reconstruction error of the data matrix based on the subset of selected columns. The paper first presents a centralized greedy algorithm for column subset selection, which depends on a novel recursive formula for calculating the reconstruction error of the data matrix. The paper then presents a MapReduce algorithm, which selects a few representative columns from a matrix whose columns are massively distributed across several commodity machines. The algorithm first A preliminary version of this paper appeared as [26]. This work was completed while the second author was at the University of Waterloo.123 A. K. Farahat et al. learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.

show abstract

Clustered subset selection and its applications on it service metrics

Cited by 12 publications

References 37 publications

Clustered low rank approximation of graphs in information science applications

Clustered low rank approximation of graphs in information science applications

Clustered Matrix Approximation

Greedy column subset selection for large-scale data sets

Contact Info

Product

Resources

About