Fault Tolerance in MapReduce: A Survey

Memishi, Bunjamin; Ibrahim, Shadi; Pérez, Marı́a S.; Antoniu, Gabriel

doi:10.1007/978-3-319-44881-7_11

Cited by 14 publications

(3 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Scientific applications are increasingly implemented to tolerate faults [3,26,20,13,17]. The three main techniques for implementing fault-tolerant algorithms are Algorithm-Based Fault-Tolerance [32,9], restarting failed sub-jobs [23], and checkpointing/restart [18,17]. Checkpointing libraries can save their checkpoint either to a (possibly network attached) disk or to the compute node's main memory ("diskless") [27].…”

Section: Related Workmentioning

confidence: 99%

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Hespe¹,

Hübner²,

Sanders³

et al. 2022

Preprint

View full text Add to dashboard Cite

Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload the lost data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of lost data after (a) process failure(s). By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in both controlled, isolated environments and real applications. Our experiments show loading times of lost input data in the range of milliseconds on up to 24 576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.

show abstract

Section: Related Workmentioning

confidence: 99%

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Hespe¹,

Hübner²,

Sanders³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Replication has been successfully employed and practiced to ensure high data availability in large-scale distributed storage systems [17,27,35]. Moreover, replication can be leveraged to improve data access performance under high load.…”

Section: Introductionmentioning

confidence: 99%

Understanding the performance of erasure codes in hadoop distributed file system

Darrous

Ibrahim²

2022

Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems

Self Cite

View full text Add to dashboard Cite

Replication has been successfully employed and practiced to ensure high data availability in large-scale distributed storage systems. However, with the relentless growth of generated and collected data, replication has become expensive not only in terms of storage cost but also in terms of network cost and hardware cost. Traditionally, erasure coding (EC) is employed as a cost-efficient alternative to replication when high access latency to the data can be tolerated. However, with the continuous reduction in its CPU overhead, EC is performed on the critical path of data access. For instance, EC has been integrated into the last major release of Hadoop Distributed File System (HDFS) which is the primary storage backend for data analytic frameworks (e.g., Hadoop, Spark, etc.). In this work, we measure and compare the performance of data accesses in HDFS under both replication and EC. Our analysis indicates that EC is a feasible solution for dataintensive applications and it can outperform replication in many scenarios. Furthermore, we demonstrate that it is the block placement algorithm in HDFS that mostly impacts the performance under EC.

show abstract

“…The complexity, heterogeneity, dynamism and inherently distributed nature of Big Data technologies do not help either for this purpose. Even models enjoying a straightforward adaptability to Big Data computing environments (e.g., ensembles for predictive modeling) can be severely affected by the obsolescence of the information from where they are learned [ 3 ], or the failure of a node in a distributed Map-Reduce computing grid [ 4 ]. All in all, data fusion, processing, learning and visualization of Big Data require a major focus not only on tailoring the algorithmic steps underlying each model/technique to the computing technologies underneath, but also endowing them with higher levels of resilience against failures, adaptation to changes in data and the accommodation of unprecedented levels of data volume, heterogeneity and veracity.…”

Section: Introductionmentioning

confidence: 99%

Bio-inspired computation for big data fusion, storage, processing, learning and visualization: state of the art and future directions

Torre-Bastida

Díaz-de-Arcaya

Osaba

et al. 2021

Neural Comput & Applic

View full text Add to dashboard Cite

This overview gravitates on research achievements that have recently emerged from the confluence between Big Data technologies and bio-inspired computation. A manifold of reasons can be identified for the profitable synergy between these two paradigms, all rooted on the adaptability, intelligence and robustness that biologically inspired principles can provide to technologies aimed to manage, retrieve, fuse and process Big Data efficiently. We delve into this research field by first analyzing in depth the existing literature, with a focus on advances reported in the last few years. This prior literature analysis is complemented by an identification of the new trends and open challenges in Big Data that remain unsolved to date, and that can be effectively addressed by bio-inspired algorithms. As a second contribution, this work elaborates on how bio-inspired algorithms need to be adapted for their use in a Big Data context, in which data fusion becomes crucial as a previous step to allow processing and mining several and potentially heterogeneous data sources. This analysis allows exploring and comparing the scope and efficiency of existing approaches across different problems and domains, with the purpose of identifying new potential applications and research niches. Finally, this survey highlights open issues that remain unsolved to date in this research avenue, alongside a prescription of recommendations for future research.

show abstract

Fault Tolerance in MapReduce: A Survey

Cited by 14 publications

References 28 publications

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Understanding the performance of erasure codes in hadoop distributed file system

Bio-inspired computation for big data fusion, storage, processing, learning and visualization: state of the art and future directions

Contact Info

Product

Resources

About