Huansong Fu scite author profile

MapReduce is popular for big data analytics because it offers easy-to-use map and reduce user interfaces while hiding the complexity of system scalability and fault resiliency issues. While a large body of literature has focused on improving the performance and scalability of MapReduce, the issue of fault resiliency has thus far received little attention. In this paper, we take on an effort to investigate the fault resiliency of MapReduce using YARN (the next-generation Hadoop) as a case study. We reveal that the failures of a MapTask, a ReduceTask or a compute node can cause distinctly different impact to MapReduce programs. Particularly, YARN MapReduce is not able to gracefully handle failures that involve ReduceTasks, causing prolonged task execution, delayed job completion, and, more severely, failure amplifications due to the cascading effects to other tasks. These problems together cause the performance collapse of MapReduce jobs. In this paper, we introduce a new faulttolerant framework that can crack down failure amplification and gracefully handle failure scenarios. It is designed with two key fault handling techniques: analytics logging and speculative fast migration. Analytics logging is a light-weight mechanism that logs the key progress information of MapReduce tasks; speculative fast migration handles node failures by proactively re-executing MapTasks, migrating ReduceTasks, and collective merging with a pipeline of shuffle/merge and reduce stages. Our performance evaluation demonstrates that these techniques can eliminate failure amplification and deliver fast job execution compared to the existing task re-execution mechanism in MapReduce.

show abstract

Multivariate modeling and two-level scheduling of analytic queries

Liu

Nath

Ding

et al. 2019

Parallel Computing

View full text Add to dashboard Cite

SHMEMGraph: Efficient and Balanced Graph Processing Using One-Sided Communication

Fu¹,

Venkata

Salman³

et al. 2018

View full text Add to dashboard Cite

FARMS: Efficient mapreduce speculation for failure recovery in short jobs

Chen

Zhu

et al. 2017

Parallel Computing

View full text Add to dashboard Cite

With the ever-increasing size of software and hardware components and the complexity of configurations, large-scale analytics systems face the challenge of frequent transient faults and permanent failures. As the indispensable part for big data analytics, MapReduce programming model is equipped with a speculation mechanism to cope with run-time stragglers and failures. However, we reveal that the existing speculation mechanism has some major drawbacks that hinder its efficiency during failure recovery, which we refer to as the speculation breakdown. We use the representative implementation of MapReduce, i.e., YARN and its speculation mechanism as a case study to demonstrate that the speculation breakdown causes significant performance degradation among MapReduce jobs, especially those with shorter turnaround time. As our experiments show, a single node failure can cause a job slowdown by up to 9.2 times. In order to address the speculation breakdown, we introduce a failure-aware speculation scheme and a refined task scheduling policy. Moreover, we have conducted a comprehensive set of experiments to evaluate the performance of both single component and the whole framework. Our experimental results show that our new framework achieves dramatic performance improvement in handling with node failures compared to the original YARN.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Huansong Fu

Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems

Cracking Down MapReduce Failure Amplification through Analytics Logging and Migration

Multivariate modeling and two-level scheduling of analytic queries

SHMEMGraph: Efficient and Balanced Graph Processing Using One-Sided Communication

FARMS: Efficient mapreduce speculation for failure recovery in short jobs

Contact Info

Product

Resources

About