Coffea-casa: an analysis facility prototype

Adamec, Matous; Attebury, Garhan; Bloom, K.; Bockelman, Brian; Lundstedt, C.; Shadura, O.; Thiltges, John

doi:10.1051/epjconf/202125102061

Cited by 7 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data delivery is less generic, in that HEP datasets have specialized formats, considerable tooling, and optimizable properties, such as statistically independent events and the columnar layouts of TTrees. Three IRIS-HEP projects, namely ServiceX [31], SkyhookDM [32], and coffea-casa [40], use generic data science tools to build HEP-specific workflows. These are good examples of the "mixed future," in which Docker Kubernetes, Helm, Minio, Flask, RabbitMQ, Kafka, Ceph, and Gandiva are used alongside ROOT, Rucio, XCache, and Uproot to deliver columns of data to analyses as Arrow or Awkward Array buffers, Parquet or ROOT files.…”

Section: Distributed Computingmentioning

confidence: 99%

HL-LHC Computing Review Stage 2, Common Software Projects: Data Science Tools for Analysis

Pivarski¹,

Rodrigues

Pedro

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Contents 1 Description and relevance for the HL-LHC 1 2 The HEP analysis software landscape is changing 2 3 The future of HEP analysis tools 9 3.1 File formats 9 3.2 Databases 11 3.3 Distributed computing 12 3.4 Acceleration 13 3.5 Histogramming 14 3.6 Fitting and statistics 16 3.7 Relationship to the community outside of HEP 19 4 Management, risk assessment, and the Grand Challenges 19 5 Bibliography 21

show abstract

Section: Distributed Computingmentioning

confidence: 99%

HL-LHC Computing Review Stage 2, Common Software Projects: Data Science Tools for Analysis

Pivarski¹,

Rodrigues

Pedro

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…It is being packaged in a way that it can be deployed on clusters outside of Nebraska. Further explanation of the concepts and demonstrations of the facility can be found in a paper for the CHEP 2021 conference [21].…”

Section: University Of Nebraskamentioning

confidence: 99%

Analysis Facilities for HL-LHC

Benjamin¹,

Bloom²,

Bockelman³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

“…Usage of Dask has begun only recently, also brought by the increased popularity in HEP of Pythonbased interfaces. In particular, it is being explored in the context of the so-called analysis facilities, where different tools are unified in a coherent software stack that can fulfill all of physicists' analysis needs [32]. In this regard, a key feature of Dask is provided by its interfaces with batch computing systems, in particular HTCondor, widely used in HEP computing clusters.…”

Section: Related Workmentioning

confidence: 99%

Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics

Padulano

Kabadzhov

Saavedra

et al. 2022

Preprint

View full text Add to dashboard Cite

The Large Hadron Collider (LHC) at CERN has generated a vast amount of information from physics events, reaching peaks of TB of data per day which are then sent to large storage facilities. Traditionally, data processing workflows in the High Energy Physics (HEP) field have leveraged grid computing resources. In this context, users have been responsible for manually parallelising the analysis, sending tasks to computing nodes and aggregating the partial results. Analysis environments in this field have had a common building block in the ROOT software framework. This is the de facto standard tool for storing, processing and visualising HEP data. ROOT offers a modern analysis tool called RDataFrame, which can parallelise computations from a single machine to a distributed cluster while hiding most of the scheduling and result aggregation complexity from users. This is currently done by leveraging Apache Spark as the distributed execution engine, but other alternatives are being explored by HEP research groups. Notably, Dask has rapidly gained popularity thanks to its ability to interface with batch queuing systems, widespread in HEP grid computing facilities. Furthermore, future upgrades of the LHC are expected to bring a dramatic increase in data volumes. This paper presents a novel implementation of the Dask backend for the distributed RDataFrame tool in order to address the aforementioned future trends. The scalability of the tool with both the new backend and the already available Spark backend is demonstrated for the first time on more than two thousand cores, testing a real HEP analysis.

show abstract

Coffea-casa: an analysis facility prototype

Cited by 7 publications

References 18 publications

HL-LHC Computing Review Stage 2, Common Software Projects: Data Science Tools for Analysis

HL-LHC Computing Review Stage 2, Common Software Projects: Data Science Tools for Analysis

Analysis Facilities for HL-LHC

Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics

Contact Info

Product

Resources

About