An Experimental Comparison of Iterative MapReduce Frameworks

Lee, Haejoon; Kang, Minseo; Youn, Sun-Bum; Lee, Jae-Gil; Kwon, Young-Hyuk

doi:10.1145/2983323.2983647

Cited by 9 publications

(2 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, Apache Spark uses read-only cached version of objects (resilient distributed dataset) which can be reused in parallel operations, thus reducing the performance overhead during iterative computation. Lee et al [48] evaluated five systems including Hadoop and Spark over various workloads to compare against four iterative algorithms. The experimentation was performed on Amazon EC2 cloud.…”

Section: Machine Learning and Iterative Tasks Supportmentioning

confidence: 99%

Big Data in Cloud Computing: A Resource Management Perspective

Ullah

Awan

Khiyal

2018

Scientific Programming

View full text Add to dashboard Cite

The modern day advancement is increasingly digitizing our lives which has led to a rapid growth of data. Such multidimensional datasets are precious due to the potential of unearthing new knowledge and developing decision-making insights from them. Analyzing this huge amount of data from multiple sources can help organizations to plan for the future and anticipate changing market trends and customer requirements. While the Hadoop framework is a popular platform for processing larger datasets, there are a number of other computing infrastructures, available to use in various application domains. The primary focus of the study is how to classify major big data resource management systems in the context of cloud computing environment. We identify some key features which characterize big data frameworks as well as their associated challenges and issues. We use various evaluation metrics from different aspects to identify usage scenarios of these platforms. The study came up with some interesting findings which are in contradiction with the available literature on the Internet.

show abstract

Section: Machine Learning and Iterative Tasks Supportmentioning

confidence: 99%

Big Data in Cloud Computing: A Resource Management Perspective

Ullah

Awan

Khiyal

2018

Scientific Programming

View full text Add to dashboard Cite

show abstract

“…Unlike Hadoop, a widely used open-source implementation of MapReduce, RDD partitions are cached in memory or on disks of each worker in the cluster. Due to the in-memory caching, Spark shows a good performance for iterative computation [31,46] which is necessary for graph mining and machine learning tasks. However, Spark still requires disk I/O [38] since its typical operations with shu ing including join and groupBy operations need to access disks for external-sort.…”

Section: Mapreduce and Sparkmentioning

confidence: 99%

PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for Scalable Graph Mining

Park¹,

Park²,

Yoon³

et al. 2017

Preprint

View full text Add to dashboard Cite

How can we analyze enormous networks including the Web and social networks which have hundreds of billions of nodes and edges? Network analyses have been conducted by various graph mining methods including shortest path computation, PageRank, connected component computation, random walk with restart, etc.ese graph mining methods can be expressed as generalized matrix-vector multiplication which consists of few operations inspired by typical matrix-vector multiplication. Recently, several graph processing systems based on matrix-vector multiplication or their own primitives have been proposed to deal with large graphs; however, they all have failed on Web-scale graphs due to insu cient memory space or the lack of consideration for I/O costs.In this paper, we propose PMV (Pre-partitioned generalized Matrix-Vector multiplication), a scalable distributed graph mining method based on generalized matrix-vector multiplication on distributed systems. PMV signi cantly decreases the communication cost, which is the main bo leneck of distributed systems, by partitioning the input graph in advance and judiciously applying execution strategies based on the density of the pre-partitioned submatrices. Experiments show that PMV succeeds in processing up to 16× larger graphs than existing distributed memory-based graph mining methods, and requires 9× less time than previous disk-based graph mining methods by reducing I/O costs signi cantly.

show abstract

An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

Kang

Lee

2017

Cluster Comput

View full text Add to dashboard Cite

An Experimental Comparison of Iterative MapReduce Frameworks

Cited by 9 publications

References 12 publications

Big Data in Cloud Computing: A Resource Management Perspective

Big Data in Cloud Computing: A Resource Management Perspective

PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for Scalable Graph Mining

An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

Contact Info

Product

Resources

About