The CMS workload management system

Cinquilli, M.; Evans, D.; Foulkes, S.; Hufnagel, D.; Mascheroni, Marco; Norman, M. D.; Maxa, Zdenek; Melo, A.; Metson, S.; Riahi, Hassen; Ryu, S.; Vaandering, Eric Wayne; Wakefield, S.; Wilkinson, R.

doi:10.1088/1742-6596/396/3/032113

Cited by 6 publications

(4 citation statements)

References 5 publications

(5 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Tier-0 agent is a workflow management system built as a modification to the central CMS WMAgent system [4]. It consists of several daemons, called components, each one of them in charge of a specific task.…”

Section: Workflow Management System (Agent)mentioning

confidence: 99%

CMS Tier-0 data processing during the detector commissioning in Run 3

Amado,

Eysermans,

Giraldo

et al. 2024

EPJ Web of Conf.

View full text Add to dashboard Cite

The CMS Tier-0 system is responsible for the prompt processing and distribution of the data collected by the CMS Experiment. A number of upgrades were implemented during the long shutdown 2 of the Large Hadron Collider, which improved the performance and reliability of the system. In this report, these upgrades are discussed and a more detailed description of the Tier-0 system is given. The experience of the data taking during Run 3 detector commissioning as well as the performance of the system are highlighted.

show abstract

Section: Workflow Management System (Agent)mentioning

confidence: 99%

CMS Tier-0 data processing during the detector commissioning in Run 3

Amado,

Eysermans,

Giraldo

et al. 2024

EPJ Web of Conf.

View full text Add to dashboard Cite

show abstract

“…The split-starter code already supports configurable hooks that can be used to implement the described data staging strategy. This model, with additional data manipulation steps, would be transparent to the CMS workload management system (WMAgent [21]), which would only be aware of the PIC RSE. In the case of merge tasks, which produce final datasets out of the multiple output files generated from different jobs (unmerged files), they would be exclusively run at PIC, as unmerged data files would always appear as registered in PIC's storage.…”

Section: Handling Of Input and Output Data Filesmentioning

confidence: 99%

Exploitation of network-segregated CPU resources in CMS

et al. 2021

View full text Add to dashboard Cite

CMS is tackling the exploitation of CPU resources at HPC centers where compute nodes do not have network connectivity to the Internet. Pilot agents and payload jobs need to interact with external services from the compute nodes: access to the application software (CernVM-FS) and conditions data (Frontier), management of input and output data files (data management services), and job management (HTCondor). Finding an alternative route to these services is challenging. Seamless integration in the CMS production system without causing any operational overhead is a key goal. The case of the Barcelona Supercomputing Center (BSC), in Spain, is particularly challenging, due to its especially restrictive network setup. We describe in this paper the solutions developed within CMS to overcome these restrictions, and integrate this resource in production. Singularity containers with application software releases are built and pre-placed in the HPC facility shared file system, together with conditions data files. HTCondor has been extended to relay communications between running pilot jobs and HTCondor daemons through the HPC shared file system. This operation mode also allows piping input and output data files through the HPC file system. Results, issues encountered during the integration process, and remaining concerns are discussed.

show abstract

“…We deployed three independent machines at Fermilab as schedulers and two additional nodes to serve as a highly-available HTCondor central manager. Instances of the CMS-specific workload submission system WMAgent [27] were deployed on each scheduler. The WMAgent instances retrieved descriptions of work from a central service, created jobs, submitted them to their local batch HTCondor scheduler, tracked their completion, and resubmitted failed jobs when needed.…”

Section: Htcondor / Submission Poolmentioning

confidence: 99%

HEPCloud, a New Paradigm for HEP Facilities: CMS Amazon Web Services Investigation

Holzman¹,

Girone²,

Hufnagel³

et al. 2017

Comput Softw Big Sci

View full text Add to dashboard Cite

OverviewThe use of highly distributed systems for high-throughput computing has been very successful for the broad scientific computing community. Programs such as the Open Science Grid [1] allow scientists to gain efficiency by utilizing available cycles across different domains. Traditionally, these programs have aggregated resources owned at different institutes, adding the important functionality to elastically contract and expand resources to match instantaneous demand as desired. An appealing scenario is to extend the reach of extensible resources to the rental market of commercial clouds.A prototypical example of such a scientific domain is the field of High Energy Physics (HEP), which is strongly dependent on high-throughput computing. Every stage of a modern HEP experiment requires massive resources (compute, storage, networking). Detector and simulationgenerated data have to be processed and associated with auxiliary detector and beam information to generate physics objects, which are then stored and made available to the experimenters for analysis. In the current computing paradigm, the facilities that provide the necessary resources utilize distributed high-throughput computing, with global workflow, scheduling, and data management, enabled by high-performance networks. The computing resources in these facilities are either owned by an experiment and operated by laboratories and university partners (e.g. Energy Frontier experiments at the Large Hadron Collider (LHC) such as CMS, ATLAS) or deployed for a specific program, owned and operated by the host laboratory (e.g. Intensity Frontier experiments at Fermilab such as NOvA, MicroBooNE).The HEP investment to deploy and operate these resources is significant: for example, at the time of this work, Abstract Historically, high energy physics computing has been performed on large purpose-built computing systems. These began as single-site compute facilities, but have evolved into the distributed computing grids used today. Recently, there has been an exponential increase in the capacity and capability of commercial clouds. Cloud resources are highly virtualized and intended to be able to be flexibly deployed for a variety of computing tasks. There is a growing interest among the cloud providers to demonstrate the capability to perform large-scale scientific computing. In this paper, we discuss results from the CMS experiment using the Fermilab HEPCloud facility, which utilized both local Fermilab resources and virtual machines in the Amazon Web Services Elastic Compute Cloud. We discuss the planning, technical challenges, and lessons learned involved in performing physics workflows on a large-scale set of virtualized resources. In addition, we will discuss the economics and operational efficiencies when executing workflows both in the cloud and on dedicated resources.

show abstract

The CMS workload management system

Cited by 6 publications

References 5 publications

CMS Tier-0 data processing during the detector commissioning in Run 3

CMS Tier-0 data processing during the detector commissioning in Run 3

Exploitation of network-segregated CPU resources in CMS

HEPCloud, a New Paradigm for HEP Facilities: CMS Amazon Web Services Investigation

Contact Info

Product

Resources

About