Exploration of memory hybridization for RDD caching in Spark

Khan, Muhib; Alam, Muhammad Mahtab; Nath, Asoke; Yu, Weikuan

doi:10.1145/3315573.3329988

Cited by 3 publications

(3 citation statements)

References 24 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Resilient distributed dataset (RDD) [36] is a fault-tolerant parallel data structure, which is the core computational model in Spark. RDDs had two types of parallel operations [37]: transformation which returns a pointer to the new RDD, and action which returns a value to the deriver after running the computation.…”

Section: Spark Frameworkmentioning

confidence: 99%

A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence

Ding

Sun

et al. 2024

IEEE Trans. Cybern.

View full text Add to dashboard Cite

Neighborhood classification (NEC) algorithms has been widely used to solve classification problems. Most traditional NEC algorithms employ the majority voting mechanism as the basis for final decision-making. However, this mechanism hardly considers the spatial difference and label uncertainty of the neighborhood samples, which may increase the possibility of the misclassification. In addition, the traditional NEC algorithms need to load the whole data into memory at once, which is computationally inefficient when the size of data set is large. To address these problems, we propose a novel Spark-based attribute reduction and neighborhood classification for rough evidence in this paper. Specifically, we first construct a multi-granular sample space using parallel undersampling method. Then, we evaluate the significance of attribute by neighborhood rough evidence decision error rate and remove the redundant attribute on different samples subspaces. Based on this attribute reduction algorithm, we design a parallel attribute reduction algorithm which is able to compute equivalence classes in parallel and parallelize the process of searching for candidate attributes. Finally, we introduce the rough evidence into the classification decision of traditional NEC algorithms and parallelize the classification decision process. Furthermore, the proposed algorithms are conducted in the Spark parallel computing framework. Experimental results on both small and large-scale data sets show that the proposed algorithms outperform the benchmarking algorithms in the classification accuracy and the computational efficiency.

show abstract

Section: Spark Frameworkmentioning

confidence: 99%

A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence

Ding

Sun

et al. 2024

IEEE Trans. Cybern.

View full text Add to dashboard Cite

show abstract

“…Its specific method steps are in the first aggregation, first assign a random number to each key-value key, then perform aggregation operations such as reduceByKey and de-key prefixing on the data with random numbers, and finally perform the full aggregation operation again. In Spark, RDD adopts a lazy evaluation mechanism, and every time an action operation is encountered, the calculation will be performed from scratch [18][19]. This means that each call to an action triggers a calculation from scratch.…”

Section: B Rdd Operator Optimization and Persistencementioning

confidence: 99%

Simulation Research on Fast Matching of Big Data Based on Spark

Song

Leng

et al. 2023

IEEE Access

View full text Add to dashboard Cite

To solve the problem of low efficiency in real-time processing and matching of CNAME records in massive DNS log data, a parallel AC automaton enhancement method based on Spark was proposed. The method is based on the Spark distributed cluster computing engine of Hadoop, which ensures the stability of massive DNS log data storage with high fault tolerance and 24-hour real-time processing. At the same time, the Spark distributed cluster uses the multi-thread parallel computing method combined with the improved AC automaton algorithm, which not only reduces the memory occupied by trie construction, but also improves the efficiency of rapid matching of CNAME records of massive DNS logs. Simulation results show that the proposed method can quickly match CNAME records of massive DNS log data. Compared with the original AC algorithm, the efficiency is significantly improved, and the time complexity and storage space are reduced.

show abstract

“…However, the training data in the parallel random forest generation process requires multiple iterations, and a large number of RDD data blocks need to be reused in the iteration until the convergence are met. Spark's default least recently used replacement algorithm (LRU) cannot cope with our model's requirement on the reuse of RDD data block because it could easily swap high-reuse block out of the cache, causing inefficiency job execution [34]. Based on these facts, a cache hierarchical replacement optimization for RDD objects is presented, which can effectively improve the cluster execution efficiency during the process of building FS-DPRF.…”

Section: Parallelization On Sparkmentioning

confidence: 99%

A Deep Random Forest Model on Spark for Network Intrusion Detection

Liu

Qin

et al. 2020

Mobile Information Systems

View full text Add to dashboard Cite

This paper focuses on an important research problem of cyberspace security. As an active defense technology, intrusion detection plays an important role in the field of network security. Traditional intrusion detection technologies have problems such as low accuracy, low detection efficiency, and time consuming. The shallow structure of machine learning has been unable to respond in time. To solve these problems, the deep learning-based method has been studied to improve intrusion detection. The advantage of deep learning is that it has a strong learning ability for features and can handle very complex data. Therefore, we propose a deep random forest-based network intrusion detection model. The first stage uses a slide window to segment original features into many small pieces and then trains a random forest to generate the concatenated class vector as rerepresentation. The vector will be used to train the multilevel cascade parallel random forest in the second stage. Finally, the classification of the original data is determined by voting strategy after the last layer of cascade. Meanwhile, the model is deployed in Spark environment and optimizes cache replacement strategy of RDDs by efficiency sorting and partition integrity check. The experiment results indicate that the proposed method can effectively detect anomaly network behaviors, with high F1-measure scores and high accuracy. The results also show that it can cut down the average execution time on different scaled clusters.

show abstract

Exploration of memory hybridization for RDD caching in Spark

Cited by 3 publications

References 24 publications

A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence

A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence

Simulation Research on Fast Matching of Big Data Based on Spark

A Deep Random Forest Model on Spark for Network Intrusion Detection

Contact Info

Product

Resources

About