An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

Kang, Minseo; Lee, Jae-Gil

doi:10.1007/s10586-017-1167-y

Cited by 5 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the disk I/O-based operation and data locality principles, we argue that any algorithm that involves intensive iterations such as the subtree generalization can cause significant overheads at multiple places such as at Disk I/O, Network, and Scheduling [40].…”

Section: Iterationmentioning

confidence: 99%

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

2021

View full text Add to dashboard Cite

Data anonymization strategies such as subtree generalization have been hailed as techniques that provide a more efficient generalization strategy compared to full-tree generalization counterparts. Many subtree-based generalizations strategies (e.g., top-down, bottom-up, and hybrid) have been implemented on the MapReduce platform to take advantage of scalability and parallelism. However, MapReduce inherent lack support for iteration intensive algorithm implementation such as subtree generalization. This paper proposes Distributed Dataset (RDD)-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts. We describe our RDDs-based approach that offers effective partition management, improved memory usage that uses cache for frequently referenced intermediate values, and enhanced iteration support. Our experimental results provide high performance compared to the existing state-of-the-art privacy preserving approaches and ensure data utility and privacy levels required for any competitive data anonymization techniques.

show abstract

Section: Iterationmentioning

confidence: 99%

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

2021

View full text Add to dashboard Cite

show abstract

“…Multiple Spark jobs initiated by different threads may run concurrently within each Spark application which gets its own executor processes. Spark runs long-running processes and threads, which stay up through the entire duration of the application and execute tasks in multiple threads, to avoid the overhead of repeatedly invoking tasks [9,10]. Allocation of executor resources on the cluster can be controlled by Spark YARN client using the --num-executors option, which overrides Spark's built-in DRA mechanism [18].…”

Section: Spark Architecture and Resilient Distributed Dataset (Rdd)mentioning

confidence: 99%

Best Trade-Off Point Method for Efficient Resource Provisioning in Spark

Nghiem

2018

Algorithms

View full text Add to dashboard Cite

Considering the recent exponential growth in the amount of information processed in Big Data, the high energy consumed by data processing engines in datacenters has become a major issue, underlining the need for efficient resource allocation for more energy-efficient computing. We previously proposed the Best Trade-off Point (BToP) method, which provides a general approach and techniques based on an algorithm with mathematical formulas to find the best trade-off point on an elbow curve of performance vs. resources for efficient resource provisioning in Hadoop MapReduce. The BToP method is expected to work for any application or system which relies on a trade-off elbow curve, non-inverted or inverted, for making good decisions. In this paper, we apply the BToP method to the emerging cluster computing framework, Apache Spark, and show that its performance and energy consumption are better than Spark with its built-in dynamic resource allocation enabled. Our Spark-Bench tests confirm the effectiveness of using the BToP method with Spark to determine the optimal number of executors for any workload in production environments where job profiling for behavioral replication will lead to the most efficient resource provisioning.

show abstract

Algorithmic Design Considerations of Big Data Analytics

2023

Springer Remote Sensing/Photogrammetry

View full text Add to dashboard Cite

An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

Cited by 5 publications

References 17 publications

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

Best Trade-Off Point Method for Efficient Resource Provisioning in Spark

Algorithmic Design Considerations of Big Data Analytics

Contact Info

Product

Resources

About