Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management 2019
DOI: 10.1145/3315573.3329988
|View full text |Cite
|
Sign up to set email alerts
|

Exploration of memory hybridization for RDD caching in Spark

Abstract: Apache Spark is a popular cluster computing framework for iterative analytics workloads due to its use of Resilient Distributed Datasets (RDDs) to cache data for in-memory processing. We have revealed that the performance of Spark RDD cache can be severely limited if its capacity falls short to the needs of the workloads. In this paper, we have explored different memory hybridization strategies to leverage emergent Non-Volatile Memory (NVM) devices for Spark's RDD cache. We have found that a simple layered hyb… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 24 publications
(35 reference statements)
0
3
0
Order By: Relevance
“…Resilient distributed dataset (RDD) [36] is a fault-tolerant parallel data structure, which is the core computational model in Spark. RDDs had two types of parallel operations [37]: transformation which returns a pointer to the new RDD, and action which returns a value to the deriver after running the computation.…”
Section: Spark Frameworkmentioning
confidence: 99%
“…Resilient distributed dataset (RDD) [36] is a fault-tolerant parallel data structure, which is the core computational model in Spark. RDDs had two types of parallel operations [37]: transformation which returns a pointer to the new RDD, and action which returns a value to the deriver after running the computation.…”
Section: Spark Frameworkmentioning
confidence: 99%
“…Its specific method steps are in the first aggregation, first assign a random number to each key-value key, then perform aggregation operations such as reduceByKey and de-key prefixing on the data with random numbers, and finally perform the full aggregation operation again. In Spark, RDD adopts a lazy evaluation mechanism, and every time an action operation is encountered, the calculation will be performed from scratch [18][19]. This means that each call to an action triggers a calculation from scratch.…”
Section: B Rdd Operator Optimization and Persistencementioning
confidence: 99%
“…However, the training data in the parallel random forest generation process requires multiple iterations, and a large number of RDD data blocks need to be reused in the iteration until the convergence are met. Spark's default least recently used replacement algorithm (LRU) cannot cope with our model's requirement on the reuse of RDD data block because it could easily swap high-reuse block out of the cache, causing inefficiency job execution [34]. Based on these facts, a cache hierarchical replacement optimization for RDD objects is presented, which can effectively improve the cluster execution efficiency during the process of building FS-DPRF.…”
Section: Parallelization On Sparkmentioning
confidence: 99%