Chronos: Failure-aware scheduling in shared Hadoop clusters

Yildiz, Orçun; Ibrahim, Shadi; Phuong, Tran Anh; Antoniu, Gabriel

doi:10.1109/bigdata.2015.7363770

Cited by 15 publications

(9 citation statements)

References 11 publications

(5 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Relationship to previous work. This paper extends our previous contribution introduced in a previous paper [8] by providing more detailed descriptions and more thorough experiments. In particular, we have substantially extended two sections: While Section 2 gives an overview of MapReduce, Hadoop, scheduling in Hadoop and its fault-tolerance mechanism, Section 8 discusses related works on scheduling, failure recovery, task preemption and data-aware task scheduling in MapReduce.…”

Section: Introductionsupporting

confidence: 58%

Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling

Yildiz

Ibrahim

Antoniu

2017

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

Hadoop emerged as the de facto state-of-the-art system for MapReduce-based data analytics. The reliability of Hadoop systems depends in part on how well they handle failures. Currently, Hadoop handles machine failures by re-executing all the tasks of the failed machines (i.e., executing recovery tasks). Unfortunately, this elegant solution is entirely entrusted to the core of Hadoop and hidden from Hadoop schedulers. The unawareness of failures therefore may prevent Hadoop schedulers from operating correctly towards meeting their objectives (e.g., fairness, job priority) and can significantly impact the performance of MapReduce applications. This paper presents Chronos, a failure-aware scheduling strategy that enables an early yet smart action for fast failure recovery while still operating within a specific scheduler objective. Upon failure detection, rather than waiting an uncertain amount of time to get resources for recovery tasks, Chronos leverages a lightweight preemption technique to carefully allocate these resources. In addition, Chronos considers data locality when scheduling recovery tasks to further improve the performance. We demonstrate the utility of Chronos by combining it with Fifo and Fair schedulers. The experimental results show that Chronos recovers to a correct scheduling behavior within a couple of seconds only and reduces the job completion times by up to 55% compared to state-of-the-art schedulers.

show abstract

Section: Introductionsupporting

confidence: 58%

Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling

Yildiz

Ibrahim

Antoniu

2017

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…Almost all the surveyed schedulers in this paper have advantages in terms of fairness and completion time compared to the default Hadoop scheduling policy. Interestingly, FRESH [10] and COSHH-hybrid [8] have the potential to become a native part of Hadoop, replacing FIFO and Fair sharing, as well as Chronos [11] which holds a lot of promise while still needing further testing. When it comes to large enterprise environments, LsPS [15] represents a promising approach as it delivered unprecedented performance and user control in a scalable and dynamic cluster, vastly improving upon default schedulers.…”

Section: Discussionmentioning

confidence: 99%

“…2) Chronos: Instead of creating a totally new "default" scheduler from scratch, a different approach for enhancing the native Hadoop scheduler is proposed by the authors of Chronos: Failure-Aware Scheduling in Shared Hadoop Clusters [11]. The authors argue that the performance of Hadoop systems in part depends on how failures are handled.…”

Section: ) Lsps (Leveraging Size Patterns Scheduler)mentioning

confidence: 99%

Hadoop MapReduce scheduling paradigms

Johannessen

Yazidi

Feng

2017

2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)

View full text Add to dashboard Cite

Apache Hadoop is one of the most prominent and early technologies for handling big data. Different scheduling algorithms within the framework of Apache Hadoop were developed in the last decade. In this paper, we attempt to provide a comprehensive overview over the different paradigms for scheduling in Apache Hadoop. The surveyed approaches fall under different categories, namely, Deadline prioritization, Resource prioritization, Job size prioritization, Hybrid approaches and recent trends for improvements upon default schedulers.

show abstract

“…In case of hardware or software failures, the affected hardware that might be prone to failure can be avoided in scheduling to avoid any further failures. Chronos (Yildiz et al 2015) is a Hadoop-based failure-aware scheduler that uses pre-emption on failed jobs. Then it recovers from failure by reallocating the failed jobs with pre-empted resources to meet the SLA objectives.…”

Section: Failure/anomaly Detection and Mitigationmentioning

confidence: 99%

Handbook of Research on Cloud Computing and Big Data Applications in IoT

2019

Advances in Computer and Electrical Engineering

View full text Add to dashboard Cite

This chapter presents software architectures of the big data processing platforms. It also provides in-depth knowledge on resource management techniques involved while deploying big data processing systems in the cloud environment. It starts from the very basics and gradually introduce the core components of resource management which are divided into multiple layers. It covers the state-of-art practices and researches done in SLA-based resource management with a specific focus on the job scheduling mechanisms.

show abstract

Chronos: Failure-aware scheduling in shared Hadoop clusters

Cited by 15 publications

References 11 publications

Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling

Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling

Hadoop MapReduce scheduling paradigms

Handbook of Research on Cloud Computing and Big Data Applications in IoT

Contact Info

Product

Resources

About