CMS has started the process of rolling out a new workload management system. This system is currently used for reprocessing and monte carlo production with tests under way using it for user analysis. It was decided to combine, as much as possible, the production/processing, analysis and T0 codebases so as to reduce duplicated functionality and make best use of limited developer and testing resources. This system now includes central request submission and management (Request Manager); a task queue for parcelling up and distributing work (WorkQueue) and agents which process requests by interfacing with disparate batch and storage resources (WMAgent).
Scientific analysis and simulation requires the processing and generation of millions of data samples. These processing and generation tasks are often comprised of multiple smaller tasks divided over multiple (computing) sites. This paper discusses the Compact Muon Solenoid (CMS) workflow infrastructure, and specifically the Python based workflow library which is used for so called task lifecycle management. The CMS workflow infrastructure consists of three layers: high level specification of the various tasks based on input/output datasets, life cycle management of task instances derived from the high level specification and execution management. The workflow library is the result of a convergence of three CMS subprojects that respectively deal with scientific analysis, simulation and real time data aggregation from the experiment.
We present the development and first experience of a new component (termed WorkQueue) in the CMS workload management system. This component provides a link between a global request system (Request Manager) and agents (WMAgents) which process requests at compute and storage resources (known as sites). These requests typically consist of creation or processing of a data sample (possibly terabytes in size).Unlike the standard concept of a task queue, the WorkQueue does not contain fully resolved work units (known typically as jobs in HEP). This would require the WorkQueue to run computationally heavy algorithms that are better suited to run in the WMAgents. Instead the request specifies an algorithm that the WorkQueue uses to split the request into reasonable size chunks (known as elements). An advantage of performing lazy evaluation of an element is that expanding datasets can be accommodated by having job details resolved as late as possible.The WorkQueue architecture consists of a global WorkQueue which obtains requests from the request system, expands them and forms an element ordering based on the request priority. Each WMAgent contains a local WorkQueue which buffers work close to the agent, this overcomes temporary unavailability of the global WorkQueue and reduces latency for an agent to begin processing. Elements are pulled from the global WorkQueue to the local WorkQueue and into the WMAgent based on the estimate of the amount of work within the element and the resources available to the agent.WorkQueue is based on CouchDB, a document oriented no-sql database. WorkQueue uses the features of CouchDB (map/reduce views, bi-directional replication between distributed instances) to provide a scalable distributed system for managing large queues of work.The project described here represents an improvement over the old approach to workload management in CMS which involved individual operators feeding requests into agents. This new approach allows for a system where individual WMAgents are transient and can be added or removed from the system as needed. Presented at CHEP 2012: International Conference on Computing in High Energy and Nuclear PhysicsThe WorkQueue project -a task queue for the CMS workload management system S Ryu 1 , S Wakefield 2 1 Fermi National Laboratory, Batavia, Illinois, USA 2 Imperial College London, London, UK E-mail: stuart.wakefield@imperial.ac.ukAbstract. We present the development and first experience of a new component (termed WorkQueue) in the CMS workload management system. This component provides a link between a global request system (Request Manager) and agents (WMAgents) which process requests at compute and storage resources (known as sites). These requests typically consist of creation or processing of a data sample (possibly terabytes in size).Unlike the standard concept of a task queue, the WorkQueue does not contain fully resolved work units (known typically as jobs in HEP). This would require the WorkQueue to run computationally heavy algorithms that are better suit...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.