On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems

Arima, Eishi; Comprés, A. Isaías; Schulz, Martin

doi:10.1007/978-3-031-23220-6_14

Cited by 7 publications

(1 citation statement)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many centers focus on developing such monitoring infrastructures, often with the goal of system tracking and tuning. A prominent example, currently deployed on production systems at LRZ and extended in the projects DEEP-SEA [37] and RE-GALE [38], is the Data Center Data Base (DCDB) [39], [40], which is capable of routinely tracking millions of sensors on large scale production systems, such as SuperMUC-NG, using technologies from the IoT space combined with a federation of time series databases built in top of Cassandra. Similarly, the ADMIRE project [41] is building an entirely new measurement infrastructure relying on the Prometheus time-series database (TSDB) connected to a node-level aggregating push gateway coupled with LIMITLESS [42] for node-level monitoring and high-speed spatial reduction based on a tree-based overlay network (TBON).…”

Section: B Monitoring and Modelingmentioning

confidence: 99%

Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities

Tarraf,

Schreiber,

Cascajo

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications' configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for almost two decades [1]. This paper presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.

show abstract

Section: B Monitoring and Modelingmentioning

confidence: 99%