As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that SWIFT significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. SWIFT can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods.
A novel three-dimensional theoretical model of magnetic flux leakage (MFL) is proposed in this paper based on the magnetic dipole model. The magnetic dipole model assumes that a ferromagnetic specimen with defects is exposed to a uniform external magnetic field that causes a uniform magnetization around the defect surface. Under this assumption, the MFL can be regarded as arising from magnetic charges on the defect surface. Previous theoretical models were mostly used to analyze simple crack defects such as cylindrical and rectangular cracks. In this paper, we developed a magnetic dipole model for more complex defect shapes such as circular truncated holes, conical holes, elliptical holes, and double-curve-shaped crack holes to complement the existing defect shapes. Experimental results and comparisons with previous models demonstrate that the proposed model provides a better approximation of complex defect shapes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.