Big data machine learning using apache spark MLlib

Assefi, Mehdi; Behravesh, Ehsun; Liu, Guangchi; Tafti, Ahmad P.

doi:10.1109/bigdata.2017.8258338

Cited by 80 publications

(43 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FlinkML library includes implementations of k-Means clustering algorithm, logistic regression, and Alternating Least Squares (ALS) for recommendation [11]. Spark has more efficient set of machine learning algorithms such as Spark MLlib [6] and MLI [51]. Spark MLlib is a scalable and fast library that is suitable for general needs and most areas of machine learning.…”

Section: Machine Learning Algorithmsmentioning

confidence: 99%

An experimental survey on big data frameworks

Inoubli

Aridhi

Mezni

et al. 2018

Future Generation Computer Systems

104

View full text Add to dashboard Cite

Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and process the data. In this paper, we discuss the challenges of Big Data and we survey existing Big Data frameworks. We also present an experimental evaluation and a comparative study of the most popular Big Data frameworks with several representative batch and iterative workloads. This survey is concluded with a presentation of best practices related to the use of studied frameworks in several application domains such as machine learning, graph processing and real-world applications.

show abstract

Section: Machine Learning Algorithmsmentioning

confidence: 99%

An experimental survey on big data frameworks

Inoubli

Aridhi

Mezni

et al. 2018

Future Generation Computer Systems

104

View full text Add to dashboard Cite

show abstract

“…As a result of this implementation, they proved that their new Smart-MLlib library scaled well than Spark's MLlib for each evaluation. Applying machine learning on a large and complex dataset requires a considerable number of physical resources to process this data, in [25], the authors explored Apache Spark MLlib version 2.0 as an open-source, distributed, scalable, and platform independent Machine Learning library, and they performed different real-world machine learning experiments to evaluate the qualitative and quantitative attributes of the platform. Alternating direction method of multipliers (ADMM) [26], it is a method used to solve a generic convex problem for most machine learning algorithms, this solution helps to transform the problem to an iterative system of linear equations, the authors implemented ADMM in Apache Spark and they compared this solution with MLlib then they showed that ADMM solution is like an alternative to MLlib for big-data problems, this approach has the added advantage of machine learning algorithms.…”

Section: Performance Evaluation Of Apache Spark Through Machine Learnmentioning

confidence: 99%

Leveraging resource management for efficient performance of Apache Spark

2019

View full text Add to dashboard Cite

Many applications generate and handle very large volumes of data, like social networking, cloud applications, public web sites, search engines, scientific simulations, data warehouse and so on. These large volumes of data are coming from a variety of sources and are both unstructured and structured data. In order to transform efficiently this massive data of various types into valuable information and meaningful knowledge, we need large-scale cluster infrastructures. In this context, one challenging problem is to realize an effective resource management of these large-scale cluster infrastructures in order to run distributed data analytics.

show abstract

“…They found that the SVM is more accurate in the condition of total average. However, M. Assefi and et al, 2017 [22] explored some views for growing the form of the Apache Spark MLlib 2.0 as an open source, accessible and achieve many machine learning tests that related to the real world to inspect the attribute characteristics. Also presents a comparison among spark and Weka with proving the advantages of spark over the Weka in many sides like the performance and it is efficient dealing with a huge amount of data.…”

Section: Related Workmentioning

confidence: 99%

Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java

Omar

Jumaa

2019

KJAR

View full text Add to dashboard Cite

Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance.

show abstract

Big data machine learning using apache spark MLlib

Cited by 80 publications

References 36 publications

An experimental survey on big data frameworks

An experimental survey on big data frameworks

Leveraging resource management for efficient performance of Apache Spark

Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java

Contact Info

Product

Resources

About