The resource demands of HPC applications vary significantly. However, it is common for HPC systems to assign resources on a pernode basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to underutilization of HPC resources. In this study, we comprehensively analyzed the resource usage and characteristics of NERSC's Perlmutter, a state-of-the-art HPC system with both CPUonly and GPU-accelerated nodes. Our three-week usage analysis revealed that the majority of jobs had low CPU utilization and that around 86% of both CPU and GPU-enabled jobs used 50% or less of the available host memory. Additionally, 52.1% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was over-provisioned in some ways for all jobs. The study also found that 60% of GPU-enabled jobs had idle GPUs, which could indicate that resource underutilization may occur as users adapt workflows to a system with new resources. Our research provides valuable insights into performance characterization and offers new perspectives for system operators to understand and track the migration of workloads. Furthermore, it can be extremely useful for designing, optimizing, and procuring HPC systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.