Proceedings of the 27th ACM Symposium on Operating Systems Principles 2019
DOI: 10.1145/3341301.3359653
|View full text |Cite
|
Sign up to set email alerts
|

Lineage stash

Abstract: As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad approaches for fault tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal operation but high overhead during recovery, while lineage-based solutions make the opposite tradeoff. We propose the lineage stash, a decentralized causal… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 30 publications
(6 citation statements)
references
References 27 publications
(25 reference statements)
0
6
0
Order By: Relevance
“…3 Ray has been increasingly adopted by many enterprises, such as Ant Group, Intel, Microsoft, and AWS, to build various AI and big data systems. [98][99][100] T A B L E 4 Comparison of feature patterns.…”
Section: Raymentioning
confidence: 99%
“…3 Ray has been increasingly adopted by many enterprises, such as Ant Group, Intel, Microsoft, and AWS, to build various AI and big data systems. [98][99][100] T A B L E 4 Comparison of feature patterns.…”
Section: Raymentioning
confidence: 99%
“…Two main types of data are logged. Spark [13] and Ray [14], [31] record lineage, i.e., the computation graph. Other systems record (or just buffer) raw, intermediate data [15], [16].…”
Section: Logging-based Failure Recoverymentioning
confidence: 99%
“…We then investigate another fundamental approach for fault tolerance in distributed systems -logging, which has been widely explored in data processing systems [13], [14], [15], [16]. We introduce logging-based recovery ( §5) for pipeline-parallel training.…”
Section: Introductionmentioning
confidence: 99%
“…For SE researchers, we suggest that they build runtime monitoring frameworks to collect traces for reproduction or adopt dynamic-analysis-based repair techniques. Existing fault reproduction methods such as checkpoint-and-replay may not be directly applied to distributed training because of the high runtime overhead or recovery overhead [99]. Researchers can design new multi-device checkpoint-and-replay techniques to help developers reproduce their faults efficiently.…”
Section: I4mentioning
confidence: 99%
“…Researchers can design new multi-device checkpoint-and-replay techniques to help developers reproduce their faults efficiently. F.7 Distributed training is usually multi-processing and can easily cause nondeterministic behaviors [99]. Sometimes developers cannot reproduce faults by running the same code again because of these characteristics of distributed training [42].…”
Section: I4mentioning
confidence: 99%