2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012) 2012
DOI: 10.1109/ccgrid.2012.24
|View full text |Cite
|
Sign up to set email alerts
|

Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

Abstract: Abstract:Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This report presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2013
2013
2017
2017

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 28 publications
(32 reference statements)
0
7
0
Order By: Relevance
“…The work presented here is a step in our attempt to control computing platforms where very little is known about applications and resources, and where situations change over time. Our works in [12,20] consider similar platform conditions but they target completely different problems, namely fault-tolerance and granularity control. We believe that results of this paper are the first ones presented to control fairness in such conditions which are often met in production platforms.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…The work presented here is a step in our attempt to control computing platforms where very little is known about applications and resources, and where situations change over time. Our works in [12,20] consider similar platform conditions but they target completely different problems, namely fault-tolerance and granularity control. We believe that results of this paper are the first ones presented to control fairness in such conditions which are often met in production platforms.…”
Section: Resultsmentioning
confidence: 99%
“…Grid conditions vary among repetitions because computing, storage and network resources are shared with other users . We use MOTEUR 0.9.21, configured to resubmit failed tasks up to 5 times, and with the task replication mechanism described in [12] activated. We use the DIRAC v6r5p1 instance provided by France-Grilles 4 , with a first-come, first-served policy imposed by submitting workflows with decreasing priority values.…”
Section: Experiments Conditionsmentioning
confidence: 99%
See 1 more Smart Citation
“…Since the outputs of each task in a workflow become inputs to subsequent tasks, and we use input size to estimate all the target parameters, poor output data size estimates for tasks at higher levels of the workflow may lead to a chain of increasing estimation errors for tasks at subsequent levels. Therefore, in addition to the offline estimation process, we also propose an online estimation process based on the MAPE-K loop (Monitoring, Analysis, Planning, Execution, and Knowledge), where task executions are constantly monitored [41,42]. Upon task completion, estimated values for the task are updated with the real values, and, based on these values, a new prediction is generated (using the regression tree of Fig.…”
Section: Online Task Resource Consumption Prediction For Scientific Wmentioning
confidence: 99%
“…Instead of directly user input in the system, User defines general procedures and policies that guide the self-management process. IBM defines four main self-* components [7] [41] [42] [43] [44] [45].…”
Section: Introductionmentioning
confidence: 99%