International audienceThis paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled
The latest advances in survival analysis have been centered on multivariate systems. Multivariate survival analysis has two major categories of models: one is multistate modeling; the other is shared frailty modeling. Multistate models, although formulated differently in both fields, have been extensively studied in reliability analysis in the context of Markov chain analysis. In contrast, shared frailty modeling seems little known in reliability analysis and computer science. In this article, we focus exclusively on shared frailty modeling. Shared frailty refers to the oftenunobserved factors or risks responsible for the common risks dependence between multiple events. It is well recognized as the most effective modeling approach to address common risks dependence and, more recently, the event-related dependence. The only exclusion of dependence modeling for the frailty approach is the common events type, which is best addressed by multi-state modeling. We argue that shared frailty modeling not only is perfectly applicable for engineering reliability, but also is of significant potential in other fields of computer science, such as networking and software reliability and survivability, machine learning, and prognostics and health management (PHM).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.