MapReduce Tuning to Improve Distributed Machine Learning Performance

Jeon, SungHwan; Chung, Haejin; Choi, Wonseok; Shin, Heeseong; Chun, Jonghoon; Kim, Jin Taek; Nah, Yunmook

doi:10.1109/aike.2018.00045

Cited by 9 publications

(5 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A distributed real-time optimization method for MapReduce frameworks in emerging cloud platforms supporting dynamic speed scaling capabilities is presented in [47], capable of dynamically scheduling input data of sufficient size and synthesizing intermediate processing results based on the state of the application and the data center, and the proposed method is able to significantly improve throughput. It is shown in [48] how MapReduce parameters affect the distributed processing of machine learning programs that are supported by the Hadoop Mahout and Spark MLlib machine learning libraries. A virtualized cluster is built on Docker Containers and Hadoop parameters such as number of replicas and data block size are changed to measure DML performance.…”

Section: Return Resultsmentioning

confidence: 99%

Distributed Learning for Wireless Communications: Methods, Applications and Challenges

Qian

Yang

Xiao

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

With its privacy-preserving and decentralized features, distributed learning plays an irreplaceable role in the era of wireless networks with a plethora of smart terminals, an explosion of information volume and increasingly sensitive data privacy issues. There is a tremendous increase in the number of scholars investigating how distributed learning can be employed to emerging wireless network paradigms in the physical layer, media access control layer and network layer. Nonetheless, researches on distributed learning for wireless communications are still in its infancy. In this paper, we review the contemporary technical applications of distributed learning for wireless communications. We first introduce the typical frameworks and algorithms for distributed learning. Examples of applications of distributed learning frameworks in the emerging wireless network paradigms are then provided. Finally, main research directions and challenges of distributed learning for wireless communications are discussed.

show abstract

Section: Return Resultsmentioning

confidence: 99%

Distributed Learning for Wireless Communications: Methods, Applications and Challenges

Qian

Yang

Xiao

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

show abstract

“…It is another method for processing massive data that can efficiently divide and use massive resources. Also, Jeon et al [19] suggested a Hadoop performance tuning method by reducing the size of data transmitted to the network and minimizing disk I/O. Spam filtering methods are largely divided into reputation-based filtering methods and content-based filtering methods.…”

Section: Related Workmentioning

confidence: 99%

New Spam Filtering Method with Hadoop Tuning-Based MapReduce Na飗e Bayes

Ji¹,

Kwon²

2023

Computer Systems Science and Engineering

View full text Add to dashboard Cite

As the importance of email increases, the amount of malicious email is also increasing, so the need for malicious email filtering is growing. Since it is more economical to combine commodity hardware consisting of a medium server or PC with a virtual environment to use as a single server resource and filter malicious email using machine learning techniques, we used a Hadoop MapReduce framework and Naïve Bayes among machine learning methods for malicious email filtering. Naïve Bayes was selected because it is one of the top machine learning methods(Support Vector Machine (SVM), Naïve Bayes, K-Nearest Neighbor(KNN), and Decision Tree) in terms of execution time and accuracy. Malicious email was filtered with MapReduce programming using the Naïve Bayes technique, which is a supervised machine learning method, in a Hadoop framework with optimized performance and also with the Python program technique with the Naïve Bayes technique applied in a bare metal server environment with the Hadoop environment not applied. According to the results of a comparison of the accuracy and predictive error rates of the two methods, the Hadoop MapReduce Naïve Bayes method improved the accuracy of spam and ham email identification 1.11 times and the prediction error rate 14.13 times compared to the non-Hadoop Python Naïve Bayes method.

show abstract

“…Hence, the name node directly forwards the jobs to a particular data node without the knowledge of the entire cluster. Jeon et al [25] show the effect of MapReduce parameters in distributed processing of machine learning program. Chung and Nah [26] showed how different virtualization methods affect the distributed processing of a massive volume of data in terms of the processing performance.…”

Section: Literature Reviewmentioning

confidence: 99%

A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network

et al. 2020

Self Cite

View full text Add to dashboard Cite

The improvement of Hadoop performance has received considerable attention from researchers in cloud computing fields. Most studies have focused on improving the performance of a Hadoop cluster. Notably, various parameters are required to configure Hadoop and must be adjusted to improve performance. This paper proposes a mechanism to improve Hadoop, schedule jobs, and allocate and utilize resources. Specifically, we present an improved ant colony optimization method to schedule jobs according to the job size and the time expected for execution. Priority is given to the job with the minimum data size and minimum response time. The resource usage and running jobs by data node are predicted using an artificial neural network, and job activity and resource usage are monitored using the resource manager. Moreover, we enhance the Hadoop Name node performance by adding an aggregator node to the default HDFS framework architecture. The changes involve four entities: the name node, secondary name node, aggregator nodes, and data nodes, where the aggregator node is responsible for assigning the jobs among the data node, and the Name node keeps tracking only the aggregator nodes. We test the overall scheme among Amazon EC2 and S3, and show the results of throughput and CPU response time for different data sizes. Finally, we show that the proposed approach shows significant improvement compare to native Hadoop and other approaches.

show abstract

MapReduce Tuning to Improve Distributed Machine Learning Performance

Cited by 9 publications

References 4 publications

Distributed Learning for Wireless Communications: Methods, Applications and Challenges

Distributed Learning for Wireless Communications: Methods, Applications and Challenges

New Spam Filtering Method with Hadoop Tuning-Based MapReduce Na飗e Bayes

A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network

Contact Info

Product

Resources

About