Lineage stash

Wang, Stephanie; Liagouris, John; Nishihara, Robert; Moritz, Philipp; Misra, Ujval; Tumanov, Alexey; Stoica, Ion

doi:10.1145/3341301.3359653

Cited by 30 publications

(6 citation statements)

References 27 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3 Ray has been increasingly adopted by many enterprises, such as Ant Group, Intel, Microsoft, and AWS, to build various AI and big data systems. [98][99][100] T A B L E 4 Comparison of feature patterns.…”

Section: Raymentioning

confidence: 99%

Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Wei,

Wang,

Sun

et al. 2024

J Software Evolu Process

View full text Add to dashboard Cite

Distributed systems have been widely used in many safety‐critical areas. Any abnormalities (e.g., service interruption or service quality degradation) could lead to application crashes or decrease user satisfaction. These things may cause serious economic losses. Among the various quality assurance approaches for distributed systems, log‐based anomaly detection (LAD) has become a popular research topic. Its popularity relates to system logs being able to record and reveal important run‐time information. This paper presents a general LAD framework for distributed systems. Log grouping and feature‐pattern mining are two crucial LAD components that impact on the anomaly‐detection effectiveness. We also present a systematic survey of techniques in these two directions; propose classification frameworks for log grouping and feature patterns; and summarize four log‐grouping techniques and five feature patterns (which refer to invariant relationships among logs that can be used for anomaly detection). To evaluate their applicability, we report on the findings when applying existing techniques to Ray, a popular industrial distributed system. Based on these findings, several open issues are identified, which provide potential guidance for future research and development.

show abstract

Section: Raymentioning

confidence: 99%

Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Wei,

Wang,

Sun

et al. 2024

J Software Evolu Process

View full text Add to dashboard Cite

show abstract

“…Two main types of data are logged. Spark [13] and Ray [14], [31] record lineage, i.e., the computation graph. Other systems record (or just buffer) raw, intermediate data [15], [16].…”

Section: Logging-based Failure Recoverymentioning

confidence: 99%

“…We then investigate another fundamental approach for fault tolerance in distributed systems -logging, which has been widely explored in data processing systems [13], [14], [15], [16]. We introduce logging-based recovery ( §5) for pipeline-parallel training.…”

Section: Introductionmentioning

confidence: 99%

Swift

Zhong

Sheng

Liu³

et al. 2023

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that SWIFT significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. SWIFT can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods.

show abstract

“…For SE researchers, we suggest that they build runtime monitoring frameworks to collect traces for reproduction or adopt dynamic-analysis-based repair techniques. Existing fault reproduction methods such as checkpoint-and-replay may not be directly applied to distributed training because of the high runtime overhead or recovery overhead [99]. Researchers can design new multi-device checkpoint-and-replay techniques to help developers reproduce their faults efficiently.…”

Section: I4mentioning

confidence: 99%

“…Researchers can design new multi-device checkpoint-and-replay techniques to help developers reproduce their faults efficiently. F.7 Distributed training is usually multi-processing and can easily cause nondeterministic behaviors [99]. Sometimes developers cannot reproduce faults by running the same code again because of these characteristics of distributed training [42].…”

Section: I4mentioning

confidence: 99%

Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective

Liu

Chen

et al. 2023

ACM Trans. Softw. Eng. Methodol.

View full text Add to dashboard Cite

Deep learning (DL) has become a key component of modern software. In the “ big model ” era, the rich features of DL-based software (i.e., DL software) substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices (e.g., a cluster of GPUs) in the training process, which is known as distributed deep learning training , or distributed training for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers’ issues in distributed training. To this end, we focus on popular DL frameworks that support distributed training (including TensorFlow, PyTorch, Keras, and Horovod) and analyze 1,131 real-world developers’ issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. We find that : (1) many distributed-specific faults and non-distributed-specific faults inherently share the same fault symptoms, making it challenging to debug; (2) most of the fault symptoms have frequent fix patterns; (3) about half of the faults are related to system-level configurations. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.

show abstract

Lineage stash

Cited by 30 publications

References 27 publications

Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Swift

Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective

Contact Info

Product

Resources

About