“…Improving the reliability of high performance computing (HPC) systems is one of the leading research areas in the HPC field. Many studies have been performed that have covered various methods for understanding how, why, and when failures occur in large-scale HPC systems (Atif & Strazdins, 2009;DeBardeleben, Blanchard, Fu, Guan, & Zhang, 2011;Fu & Xu, 2007;Hacker et al, 2009;Oliner & Stearley, 2007;Pandit, Kalbarczyk, & Iyer, 2009;Romero, 2010;Salfner & Tschirpke, 2008;Zhang, Squillante, Sivasubramaniam, & Sahoo, 2004;Zheng, Lan, Park, & Geist, 2009;Zhou, Zhan, Meng, Xu, & Zhang, 2010;.…”