2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA) 2016
DOI: 10.1109/aina.2016.84
|View full text |Cite
|
Sign up to set email alerts
|

Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation

Abstract: -Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
26
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
1
1

Relationship

3
4

Authors

Journals

citations
Cited by 23 publications
(27 citation statements)
references
References 10 publications
1
26
0
Order By: Relevance
“…50% greater than median task execution, DoS-Index ≥ 2.5), as tasks that complete just below the threshold will be detected as false positives, however still impede job execution completion. These results indicate the need for more intelligent metrics for straggler detection -transitioning away from a fixed temporal boundary as defined in [4][8] towards an adaptive boundary that consider metrics such as task progression, system conditions and job QoS as detailed in [34]. We observe that the online analytics agent produce approximately 0.2% CPU usage for both job types, representing a fractional amounts of server usage, and observing no indication of causing an increment in straggler behavior caused by high CPU.…”
Section: Experiments Resultsmentioning
confidence: 99%
“…50% greater than median task execution, DoS-Index ≥ 2.5), as tasks that complete just below the threshold will be detected as false positives, however still impede job execution completion. These results indicate the need for more intelligent metrics for straggler detection -transitioning away from a fixed temporal boundary as defined in [4][8] towards an adaptive boundary that consider metrics such as task progression, system conditions and job QoS as detailed in [34]. We observe that the online analytics agent produce approximately 0.2% CPU usage for both job types, representing a fractional amounts of server usage, and observing no indication of causing an increment in straggler behavior caused by high CPU.…”
Section: Experiments Resultsmentioning
confidence: 99%
“…But tasks in a parallel job are usually independent to each other, though long tails do not directly affect the progress rate of the other colocated tasks, long running tasks still delay the job completion to an irresistible margin. Most of the existing literature adopts a threshold value of 50% [3] to classify long tails, whereby tasks exhibiting a runtime duration 50% more than the average job duration are classified as stragglers, as shown in equation (8). In general, the identification efficiency of long tails depends on the time of classification.…”
Section: Long Tail Stragglersmentioning
confidence: 99%
“…Users submit their request in the form of jobs [1] at the datacentres, a single job may encompass one to several number of tasks. The scheduler [2] in the datacentre schedules and allocates the tasks belonging to a single job across the available server nodes based on the computational requirements of every individual tasks, unlike a typical Map Reduce platform [3] divides the job into multiple tasks for executing the tasks in an evenly distributed subsets. In other words, tasks belonging to a single job show increased level of heterogeneity [4,5] in terms of their resource requirements, resource consumption and task duration etc., but a Map Reduce platform tries to achieve even execution profiles across the distributed tasks such as similar duration, resource consumption etc.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In most literature pertaining to stragglers, this threshold is typically configured to a value of 1.5 (i.e. tasks whose execution is 50% greater than the average execution of tasks within the same job) [10][14] [18], while [19] proposes a dynamic threshold calculation algorithm to define task stragglers in accordance to workload type and cluster resource usage.…”
Section: Framework For Modeling and Ranking Node Execution Perfomentioning
confidence: 99%