Accelerating Iterative Big Data Computing Through MPI

Fan, Lei; Lu, Xiaoyi

doi:10.1007/s11390-015-1522-5

Cited by 15 publications

(9 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other works attempted to develop novel approaches to data-centric programming. For example, in [88] the authors proposed an event-driven pipeline and in-memory shuffle using DataMPI-Iteration, which provided overlapping of computation and communication for iterative BDA computing and showed a speedup of 9× − 21× over Apache Hadoop and 2×−3× over Apache Spark for PageRank and K-means. Another approach for running data-centric applications on MPI beyond the Map-Reduce model was proposed in [89], where the authors presented a set of building blocks that provide scalable data movement capability to computational scientists and visualization researchers for writing their own parallel analysis.…”

Section: ) Process-centric Computing Models: Mpi and Openmpmentioning

confidence: 99%

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

et al. 2019

View full text Add to dashboard Cite

Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process-and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment. INDEX TERMS Big data analytics, high performance computing, spark, DIY, MPI.

show abstract

Section: ) Process-centric Computing Models: Mpi and Openmpmentioning

confidence: 99%

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Michael et al [37] presented Spark SQL as SQL-like data processing built on the top of Spark Core. The presented Spark SQL processes data upon RDD abstraction which has many limitations in computations and data in-memory nature [12,13]. The use of the main memory by Spark SQL affects more cost-effective of the infrastructure, due to memory resources which are very expensive than disk while Hadoop processes data on disk [38].…”

Section: Related Workmentioning

confidence: 99%

“…RDD's data can be recomputed if they have lost in failure because it avoids data replication [12]. HLQL based on RDD do many iterative computations on the same data which necessities a lot of memory for keeping the data [13]. Therefore, the use of the main memory by HLQL based on RDD affects the cost of the infrastructure because memory resources are very expensive than disk.…”

mentioning

confidence: 99%

Evaluation of high-level query languages based on MapReduce in Big Data

2018

View full text Add to dashboard Cite

Since it was presented by Google in 2004, MapReduce (MR) [1] has been emerged as a popular framework for Big Data processing model in cluster environment and cloud computing [2]. It has become a key of success for processing, analyzing and managing large data sets with some number of implementations including the open-source Hadoop framework [3, 4]. MR has many interesting qualities, highly noticed in its design and plainness in writing programs. It has only two functions, known as Map and Reduce, written by developer to process key-value data pairs. Even though, MR is very simple to understand its principle and basic concepts but it is hard to develop, optimize, and maintain its functions especially in large-scale projects [5]. MR requires approaching any problem in terms of key-value pairs where each pair can be independently computed. Also, coding efficient MapReduce programs, mainly in Java, was non-trivial for those who interested to build large-scale projects even though their programming level. This meant that many operations need multiple inputs/outputs, both simple and complex, that were very hard to achieve without wasting programming efforts and time.

show abstract

“…In (Lu et al, 2014;Lu and Liang, 2016) In (Reyes-Ortiz et al, 2015) authors compare Apache Spark performance with MPI/OpenMP based on KNN and Pegasos SVM machine learning algorithms. The results showed that MPI/OpenMP approach is still more than 10 times faster in terms of running time, however, one should note that Spark has an advantage of caching and authors did not mention this in their paper.…”

Section: Related Workmentioning

confidence: 99%

Novel Apache Spark based Algorithm to Solve Dirichlet Problem for Poisson Equation in 3D Computational Domain

Aday¹,

Mansurova²

2016

Journal of Computer Science

View full text Add to dashboard Cite

Parallel computations are essential tool in solving large-scale computationally demanding problems. Due to large diversity and heterogeneity of the currently available parallel processing techniques and paradigms it is usually difficult to find the right solution that will perform well according to every performance metric. As one of the recent developments in parallel computing Apache Spark framework allows to process petabyte-scale data and possesses properties such as fault tolerance, scalability, load balancing and mechanisms of in memory computations across nodes of the cluster. All of these features are attractive for high performance scientific computing. It has been shown that Apache Spark outperforms Hadoop implementation of some machine learning algorithms by orders of magnitude. Since Hadoop platform is not well suited for iterative computing, typical for many computational problems, in this study we investigate performance characteristics of Apache Spark on scientific computing problems, particularly for solving Dirichlet problem for Poisson's equation. An algorithm for solving Dirichlet problem for Poisson's equation is described and analyzed and compared to optimized Hadoop-based implementations. Apache Spark uses new distributed data structure called RDD. Presented algorithm consists of operations on RDD such as mapping, grouping and partitioning. The benefits and drawbacks of the algorithm as well as applicability for stencil type computations are discussed and analyzed.

show abstract

Accelerating Iterative Big Data Computing Through MPI

Cited by 15 publications

References 19 publications

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

Evaluation of high-level query languages based on MapReduce in Big Data

Novel Apache Spark based Algorithm to Solve Dirichlet Problem for Poisson Equation in 3D Computational Domain

Contact Info

Product

Resources

About