Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.
The prompt reconstruction of the data recorded from the Large Hadron Collider (LHC) detectors has always been addressed by dedicated resources at the CERN Tier-0. Such workloads come in spikes due to the nature of the operation of the accelerator and in special high load occasions experiments have commissioned methods to distribute (spill-over) a fraction of the load to sites outside CERN. The present work demonstrates a new way of supporting the Tier-0 environment by provisioning resources elastically for such spilled-over workflows onto the Piz Daint Supercomputer at CSCS. This is implemented using containers, tuning the existing batch scheduler and reinforcing the scratch file system, while still using standard Grid middleware. ATLAS, CMS and CSCS have jointly run selected prompt data reconstruction on up to several thousand cores on Piz Daint into a shared environment, thereby probing the viability of the CSCS high performance computer site as on demand extension of the CERN Tier-0, which could play a role in addressing the future LHC computing challenges for the high luminosity LHC.
The scale of Leadership Class Systems presents unique challenges to the features and performance of operating system services. This paper reports results of comprehensive evaluations of two Light Weight Operating Systems (LWOS), Cray's Catamount Virtual Node (CVN) and Linux Environment (CLE) operating systems, on the exact same large-scale hardware. The evaluation was carried out over a 5-month period on NERSC's 19,480 core Cray XT-4, Franklin, using a comprehensive evaluation method that spans Performance, Effectiveness, Reliability, Consistency and Usability criteria for all major subsystems and features. The paper presents the results of the comparison between CVN and CLE, evaluates their relative strengths, and reports observations regarding the world's largest Cray XT-4 as well.
A short description of work completed at NERSC over the past 6 months to identify and remedy asymmetries in the in the batch compute resources provided by NERSC's IBM SP seaborg.nersc.gov.Background: NERSC's IBM SP consists of 375 Mhz NighthawkII power 3 nodes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.