Checkpoint Placement for Systematic Fault-Injection Campaigns

Dietrich, Christian; Thomas, Tim-Marek; Mnich, Matthias

doi:10.1109/iccad57390.2023.10323809

Cited by 1 publication

(1 citation statement)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Checkpointing is a technique that has been used virtually from the beginning of computing to improve fault tolerance and to make sure that the system can correctly recover from errors, failures, or interruptions Schulz (2011). Checkpoint and recovery approaches advanced in the 1960s and 1970s as computers improved in complexity and dependability and are still a key component of safety-critical systems Dietrich et al (2023); Goulart et al (2023). Researchers created checkpoint and recovery techniques based on software that allowed the system to recover from both hardware and software errors and disruptions Schulz (2011).…”

Section: Checkpointingmentioning

confidence: 99%

A Recovery-point Mechanism for Low-power Embedded ML Applications

Cheung,

Beckett,

Kumar

2023

Preprint

View full text Add to dashboard Cite

Increasingly, machine learning applications are being run on systems comprising lowpower CPUs driven by unreliable or intermittent power sources. In the event of a system failure, these applications typically have to be re-run from the beginning, which can waste both time and energy, as well as potentially compromising the training process in a ML algorithm. This paper proposes a model to allow an embedded operating system to auto-discover a suitable recovery and restart point so that a failed application can be restarted with minimal effect on its performance. The proposal encompasses a complete software stack that comprises a modified cross-compilation tool chain, a modified XV6 OS kernel, and a custom executable loader. The model exhibits time savings of over 50% in case where the application has passed the midpoint of its run, but is less effective if the failure point occurs earlier in the application’s run time.

show abstract

Section: Checkpointingmentioning

confidence: 99%