2017
DOI: 10.1016/j.future.2016.02.015
|View full text |Cite
|
Sign up to set email alerts
|

Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling

Abstract: Hadoop emerged as the de facto state-of-the-art system for MapReduce-based data analytics. The reliability of Hadoop systems depends in part on how well they handle failures. Currently, Hadoop handles machine failures by re-executing all the tasks of the failed machines (i.e., executing recovery tasks). Unfortunately, this elegant solution is entirely entrusted to the core of Hadoop and hidden from Hadoop schedulers. The unawareness of failures therefore may prevent Hadoop schedulers from operating correctly t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 32 publications
(7 citation statements)
references
References 24 publications
0
7
0
Order By: Relevance
“…Their study was used for heterogeneous environments but not applied to dynamic or unreliable environments. Yildiz et al [22] coped with failure recovery by prioritizing tasks and allowing the tasks to pre-empt other tasks. The key idea was to alleviate the impact of node failure and achieve a reduction in overall completion time.…”
Section: Related Workmentioning
confidence: 99%
“…Their study was used for heterogeneous environments but not applied to dynamic or unreliable environments. Yildiz et al [22] coped with failure recovery by prioritizing tasks and allowing the tasks to pre-empt other tasks. The key idea was to alleviate the impact of node failure and achieve a reduction in overall completion time.…”
Section: Related Workmentioning
confidence: 99%
“…To reduce the impact of dynamics in distributed environments. Yildiz et al [16] proposed a model to recover failure of computing nodes. The key idea was to prioritize tasks and to allow the prioritized tasks to pre-empt other tasks in failure recovery.…”
Section: Related Workmentioning
confidence: 99%
“…These approaches reduce the task fault occurrences and improve their overall performance with low latency in fault detection. Furthermore, works by Yildiz et al [ 21 ] and Kadirvel et al [ 22 ] aimed to decrease resource usage during failures by adopting a lightweight pre-emption technique and dynamic resource scaling to reduce the cost of the additional resources when removing failures.…”
Section: Related Workmentioning
confidence: 99%