SystemML: Declarative machine learning on MapReduce

Ghoting, Amol; Krishnamurthy, Rajasekar; Pednault, Edwin P. D.; Reinwald, Berthold; Sindhwani, Vikas; Tatikonda, Shirish; Tian, Yuanyuan; Vaithyanathan, Shivakumar

doi:10.1109/icde.2011.5767930

Cited by 259 publications

(207 citation statements)

References 12 publications

Supporting

Mentioning

202

Contrasting

Unclassified

Order By: Relevance

“…From a performance perspective, Stonebraker et al [6] propose the reuse of carefully optimized external C++ libraries as user defined functions for linear algebra calculations, but they leave the problem with resource management and suitable data structures in this "hybrid" world yet unsolved. Another approach based on Hadoop is SystemML [1], where basic linear algebra primitives are addressable via a subset of the R language with a MapReduce backend. Few commercial data warehouse vendors already offer minor support for linear algebra operations integrated in the database engine, but to the best of our knowledge there is no solution which integrates transparent optimization based on topological features of the matrix (e.g., sparsity).…”

Section: Linear Algebra In Databasesmentioning

confidence: 99%

“…Analytical algorithms in those fields are often composed of linear algebra operations, including matrix-matrix, matrix-vector and elementwise multiplications. Moreover, linear algebra operations form the building blocks of machine learning algorithms [1] used in data warehousing environments, which is a common domain for commercial databases.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Bringing Linear Algebra Objects to Life in a Column-Oriented In-Memory Database

Kernert

Köhler

Lehner

2015

In Memory Data Management and Analysis

View full text Add to dashboard Cite

Abstract. Large numeric matrices and multidimensional data arrays appear in many science domains, as well as in applications of financial and business warehousing. Common applications include eigenvalue determination of large matrices, which decompose into a set of linear algebra operations. With the rise of in-memory databases it is now feasible to execute these complex analytical queries directly in the database without being restricted by hard disc latencies for random accesses. In this paper, we present a way to integrate linear algebra operations and large matrices as first class citizens into an in-memory database following a two-layered architectural model. The architecture consists of a logical component receiving manipulation statements and linear algebra expressions, and of a physical layer, which autonomously administrates multiple matrix storage representations. A cost-based hybrid storage representation is presented and an experimental implementation is evaluated for matrix-vector multiplications.

show abstract

Section: Linear Algebra In Databasesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Bringing Linear Algebra Objects to Life in a Column-Oriented In-Memory Database

Kernert

Köhler

Lehner

2015

In Memory Data Management and Analysis

View full text Add to dashboard Cite

show abstract

“…RHIPE [23] and SystemML [7] also aim to extend R for largescale distributed computations. All the cited efforts to merge R with MapReduce suffer from two issues: (1) R is based on C and hence not native to Java-based frameworks such as Hadoop and Spark.…”

Section: Related Workmentioning

confidence: 99%

“…Data analysts aim to extract knowledge and to better understand their data; they do not wish to learn complex programming paradigms and new languages. Recent implementations attempt to provide highlevel interfaces for data mining and associated algorithms which are compiled to low-level primitives [6], [7]. Such developments tend to require knowledge of the underlying distributed system effectively shifting the focus from data mining to individual algorithm implementation.…”

Section: Introductionmentioning

confidence: 99%

A Parallel Distributed Weka Framework for Big Data Mining Using Spark

Koliopoulos

Yiapanis

Tekiner

et al. 2015

2015 IEEE International Congress on Big Data

View full text Add to dashboard Cite

Abstract-Effective Big Data Mining requires scalable and efficient solutions that are also accessible to users of all levels of expertise. Despite this, many current efforts to provide effective knowledge extraction via large-scale Big Data Mining tools focus more on performance than on use and tuning which are complex problems even for experts.Weka is a popular and comprehensive Data Mining workbench with a well-known and intuitive interface; nonetheless it supports only sequential single-node execution. Hence, the size of the datasets and processing tasks that Weka can handle within its existing environment is limited both by the amount of memory in a single node and by sequential execution.This work discusses DistributedWekaSpark, a distributed framework for Weka which maintains its existing user interface. The framework is implemented on top of Spark, a Hadoop-related distributed framework with fast in-memory processing capabilities and support for iterative computations.By combining Weka's usability and Spark's processing power, DistributedWekaSpark provides a usable prototype distributed Big Data Mining workbench that achieves nearlinear scaling in executing various real-world scale workloads -91.4% weak scaling efficiency on average and up to 4x faster on average than Hadoop.

show abstract

“…Large scale graph mining using HADOOP has attracted significant interests [4,5,6] due to its simplicity, fault tolerance, and low maintenance costs, compared to graph mining based on MPI [7] and Bulk Synchronous Parallel model [8].…”

Section: Related Workmentioning

confidence: 99%

Pegasus: Mining billion-scale graphs in the cloud

Kang

Chau

Faloutsos

2012

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We have entered in an era of big data. Graphs are now measured in terabytes or even petabytes; analyzing them has become increasingly challenging. How do we find patterns and anomalies in these graphs that no longer fit in memory? How should we exploit parallel computation to boost our analysis capabilities? We present PEGASUS, the first opensource, peta-scale graph mining library, for the HADOOP platform (open-source implementation of MAPREDUCE).By observing that many graph mining operations can be described by repeated matrix-vector multiplications, we devised an important primitive called GIM-V for PEGASUS that applies to all such operations. GIM-V (Generalized Iterative Matrix-Vector multiplication) is highly optimized, achieving (1) good scale-up with the number of machines, (2) linear run time on the number of edges, and (3) more than 9 times faster performance over the non-optimized version. We ran experiments for PEGASUS on M45, one of the largest HADOOP clusters in the world. We report our findings on several real graphs with billions of nodes and edges. Selected findings include (a) the discovery of adult advertisers in the whofollows-whom on Twitter, and (b) the 7-degrees of separation in the Web graph.

show abstract

SystemML: Declarative machine learning on MapReduce

Cited by 259 publications

References 12 publications

Bringing Linear Algebra Objects to Life in a Column-Oriented In-Memory Database

Bringing Linear Algebra Objects to Life in a Column-Oriented In-Memory Database

A Parallel Distributed Weka Framework for Big Data Mining Using Spark

Pegasus: Mining billion-scale graphs in the cloud

Contact Info

Product

Resources

About