2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018
DOI: 10.1109/dsn.2018.00022
|View full text |Cite
|
Sign up to set email alerts
|

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
21
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 61 publications
(25 citation statements)
references
References 28 publications
1
21
0
Order By: Relevance
“…Although some of these studies utilize workload inherent features, such as the memory reuse time, they are orthogonal to our work. Predictive maintenance and statistical prediction of errors: Considerable research has been done on statistical pre-diction of different types of hardware faults, including DRAM errors, in supercomputers [4], [18], [35], [38], [44], [58], [66], [89]. The majority of these studies proposed different techniques, based either on rules [38] or Machine Learning [66], for prediction of failures that may happen in various hardware components using history of errors.…”
Section: Related Workmentioning
confidence: 99%
“…Although some of these studies utilize workload inherent features, such as the memory reuse time, they are orthogonal to our work. Predictive maintenance and statistical prediction of errors: Considerable research has been done on statistical pre-diction of different types of hardware faults, including DRAM errors, in supercomputers [4], [18], [35], [38], [44], [58], [66], [89]. The majority of these studies proposed different techniques, based either on rules [38] or Machine Learning [66], for prediction of failures that may happen in various hardware components using history of errors.…”
Section: Related Workmentioning
confidence: 99%
“…A previous study that analyzes GPU errors on the same system [48] reports that 98% of the detected errors come from the L2 cache. The findings of Nie et al [47], therefore, could not be directly applied to device DRAM error prediction.…”
Section: Gpu Memory Errorsmentioning
confidence: 99%
“…A few recent studies have analysed GPU errors in the field [46], [47]. Nie et al [47] analyze the GPU errors on the Titan supercomputer, which comprises 18,688 K20X GPUs.…”
Section: Gpu Memory Errorsmentioning
confidence: 99%
See 1 more Smart Citation
“…Nie et al [18] continue this work with the analysis of the GPU-error related data on the same system from February 2015 to June 2015. The follow-up study from the same team [19] proposes and evaluates several machine learning-based models for the GPU error prediction. The studies reveal interesting insights about the temporal and spatial distribution of GPU errors, their correlation with temperature, GPU power consumption and workload characteristics.…”
Section: Related Workmentioning
confidence: 99%