The Worldwide LHC Computing Grid (WLCG) is an innovative distributed environment which is deployed through the use of grid computing technologiesin order to provide computing and storage resources to the LHC experimentsfor data processing and physics analysis. Following increasing demands of LHC computing needs toward high luminosity era, the experiments are engagdin an ambitious program to extend the capability of WLCG distributed environment, for instance including opportunistically used resources such as High-Performance Computers (HPCs), cloud platforms and volunteer computer. norder to be effectively used by the LHC experiments, all these diverse distributed resources should be described in detail. This implies easy service discovery of shared physical resources, detailed description of service configurations and experiment-specific data structures is needed. In this contribution, we present a high-level information component of a distributed computing environment, the Computing Resource Information Catalogue (CRIC) which aims to facilitate distributed computing operations for the LHC experiments and consolidate WLCG topology information. In addition, CRIC performs data validation and provides coherent view and topology descriptinto the LHC VOs for service discovery and configuration. CRIC represents teevolution of ATLAS Grid Information System (AGIS) into the common experiment independent high-level information framework. CRIC’s mission is to serve not just ATLAS Collaboration needs for the description of the distributed environment but any other virtual organization relying on large scale distributed infrastructure as well as the WLCG on the global scope. The contribution describes CRIC architecture, implementation of data model,collectors, user interfaces, advanced authentication and access control components of the system.
CRIC is a high-level information system which provides flexible, reliable and complete topology and configuration description for a large scale distributed heterogeneous computing infrastructure. CRIC aims to facilitate distributed computing operations for the LHC experiments and consolidate WLCG topology information. It aggregates information coming from various low-level information sources and complements topology description with experimentspecific data structures and settings required by the LHC VOs in order to exploit computing resources. Being an experiment-oriented but still experiment-independent information middleware, CRIC offers a generic solution, in the form of a suitable framework with appropriate interfaces implemented, which can be successfully applied on the global WLCG level or at the level of a particular LHC experiment. For example there are CRIC instances for CMS[11] and ATLAS[10]. CRIC can even be used for a special task. For example, a dedicated CRIC instance has been built to support transfer tests performed by DOMA Third Party Copy working group. Moreover, extensibility and flexibility of the system allow CRIC to follow technology evolution and easily implement concepts required to describe new types of computing and storage resources. The contribution describes the overall CRIC architecture, the plug-in based implementation of the CRIC components as well as recent developments and future plans.
The Compact Muon Solenoid (CMS) experiment heavily relies on the CMSWEB cluster to host critical services for its operational needs. The cluster is deployed on virtual machines (VMs) from the CERN OpenStack cloud and is manually maintained by operators and developers. The release cycle is composed of several steps, from building RPMs to their deployment, validation, and integration tests. To enhance the sustainability of the CMSWEB cluster, CMS decided to migrate its cluster to a containerized solution based on Docker and orchestrated with Kubernetes (K8s). This allows us to significantly speed up the release upgrade cycle, follow the end-to-end deployment procedure, and reduce operational cost. In this paper, we give an overview of the CMSWEB VM cluster and the issues we discovered during this migration. We discuss the architecture and the implementation strategy in the CMSWEB Kubernetes cluster. Even though Kubernetes provides horizontal pod autoscaling based on CPUs and memory, in this paper, we provide details of horizontal pod autoscaling based on the custom metrics of CMSWEB services. We also discuss automated deployment procedure based on the best practices of continuous integration/continuous deployment (CI/CD) workflows. We present performance analysis between Kubernetes and VM based CMSWEB deployments. Finally, we describe various issues found during the implementation in Kubernetes and report on lessons learned during the migration process.
The CMS experiment heavily relies on the CMSWEB cluster to host critical services for its operational needs. The cluster is deployed on virtual machines (VMs) from the CERN OpenStack cloud and is manually maintained by operators and developers. The release cycle is composed of several steps, from building RPMs, their deployment to perform validation, and integration tests. To enhance the sustainability of the CMSWEB cluster, CMS decided to migrate its cluster to a containerized solution such as Docker, orchestrated with Kubernetes (k8s). This allows us to significantly reduce the release upgrade cycle, follow the end-to-end deployment procedure, and reduce operational cost. This paper gives an overview of the current CMSWEB cluster and its issues. We describe the new architecture of the CMSWEB cluster in Kubernetes. We also provide a comparison of VM and Kubernetes deployment approaches and report on lessons learned during the migration process.
In the near future, large scientific collaborations will face unprecedented computing challenges. Processing and storing exabyte datasets require a federated infrastructure of distributed computing resources. The current systems have proven to be mature and capable of meeting the experiment goals, by allowing timely delivery of scientific results. However, a substantial amount of interventions from software developers, shifters and operational teams is needed to efficiently manage such heterogeneous infrastructures. A wealth of operational data can be exploited to increase the level of automation in computing operations by using adequate techniques, such as machine learning (ML), tailored to solve specific problems. The Operational Intelligence project is a joint effort from various WLCG communities aimed at increasing the level of automation in computing operations. We discuss how state-of-the-art technologies can be used to build general solutions to common problems and to reduce the operational cost of the experiment computing infrastructure.
This paper summarizes the various storage options that we implemented for the CMSWEB cluster in Kubernetes infrastructure. All CMSWEB services require storage for logs, while some services also require storage for data. We also provide a feasibility analysis of various storage options and describe the pros/cons of each technique from the perspective of the CMSWEB cluster and its users. In the end, we also propose recommendations according to the service needs. The first option is the CephFS which can be mounted multiple times across various clusters and VMs and works very well with k8s. We use it both for data and the logs. The second option is the Cinder volume. It is the block storage that runs the filesystem on top of it. It can only be attached to one instance at a time. We use this option only for the data. The third option is S3 storage. It is object storage that offers a scalable storage service that can be used by applications compatible with the Amazon S3 protocol. It is used for the logs. For S3, we explored two mechanisms. For the first scenario, we consider fluentd that runs as a sidecar container in the service pods and sends logs to S3 bucket. For the second scenario, we considered filebeat that runs as a sidecar container in the service pod and scaps those logs to fluentd which runs as a daemonset in each node and sends those logs to S3 in the end. The fourth option is EOS. We configured EOS inside the pods of the CMSWEB services. The fifth option that we explored is to use dedicated VMs that have Ceph volume attached to them. In EOS and VM, the logs from the service pods are sent to EOS/VM using the rsync approach. The last option is to send service logs to Elasticsearch. It has been implemented using fluentd that runs as a daemonset in each node. In parallel to the sending logs to S3 fluentd also sends those logs to the Elasticsearch infrastructure at CERN.
As a joint effort from various communities involved in the Worldwide LHC Computing Grid, the Operational Intelligence project aims at increasing the level of automation in computing operations and reducing human interventions. The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results. However, a substantial number of interventions from software developers, shifters, and operational teams is needed to efficiently manage such heterogenous infrastructures. Under the scope of the Operational Intelligence project, experts from several areas have gathered to propose and work on “smart” solutions. Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.
The Compact Muon Solenoid (CMS) experiment heavily relies on the CMSWEB cluster to host critical services for its operational needs. Recently, CMSWEB cluster has been migrated from the VM cluster to the Kubernetes (k8s) cluster. The new cluster of CMSWEB in Kubernetes enhances sustainability and reduces the operational cost. In this work, we added new features to the CMSWEB k8s cluster. The new features include the deployment of services using Helm's chart templates and the incorporation of canary releases using Nginx ingress weighted routing that is used to route traffic to multiple versions of the services simultaneously. The usage of Helm simplifies the deployment procedure and no expertise of Kubernetes are needed anymore for service deployment. Helm packages all dependencies, and services are easily deployed, updated and rolled back. Helm enables us to deploy multiple versions of the services to run simultaneously. This feature is very useful for developers to test the new versions of the services by assigning some weight to the new service version and rolling back immediately in case of issues. Using Helm, we can also deploy different application configurations at runtime.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.