Abstract-This paper presents an in-depth analysis of the impact of system noise on large-scale parallel application performance in realistic settings. Our analytical model shows that not only collective operations but also point-to-point communications influence the application's sensitivity to noise. We present a simulation toolchain that injects noise delays from traces gathered on common large-scale architectures into a LogGPS simulation and allows new insights into the scaling of applications in noisy environments. We investigate collective operations with up to 1 million processes and three applications (Sweep3D, AMG, and POP) with up to 32,000 processes. We show that the scale at which noise becomes a bottleneck is system-specific and depends on the structure of the noise. Simulations with different network speeds show that a 10x faster network does not improve application scalability. We quantify noise and conclude that our tools can be utilized to tune the noise signatures of a specific system.
I. MOTIVATION AND BACKGROUNDThe performance impact of operating system and architectural overheads (system noise) at massive scale is increasingly of concern. Even small local delays on compute nodes, which can be caused by interrupts, operating system daemons, or even cache or page misses, can affect global application performance significantly [1]. Such local delays often cause less than 1% overhead per process but severe performance losses can occur if noise is propagated (amplified) through communication or global synchronization. Previous analyses generally assume that the performance impact of system noise grows at scale and Tsafrir et al. [2] even suggest that the impact of very low frequency noise scales linearly with the system size.
A. Related WorkPetrini, Kerbyson, and Pakin [1] report that the parallel performance of SAGE on a fixed number of ASCI Q nodes was highest when SAGE used only three of the four CPUs per node. It turned out that "resonance" between the application's collective communication and the misconfigured system caused delays during each iteration. Jones, Brenner, and Fier [3] observed similar effects with collective communication and also report that, under certain circumstances, it is beneficial to leave one CPU idle. A theoretical analysis of the influence of noise on collective communication [4] suggests that the impact of noise depends on the type of distribution and their parameters and can, in the worst case (exponential distribution), scale linearly with the number of processes. Ferreira, Bridges, and Brightwell use noise-injection techniques to assess the impact of noise on several applications [5]. Beckman et al.[6] analyzed the performance on BlueGene/L, concluding that most sources of noise can be avoided in very specialized systems.Previous work was either limited to experimental analysis on specific architectures with injection of artificially generated noise (fixed frequency), or to purely theoretical analyses that assume a particular collective pattern [4]. These previou...