Analyzing Scientific Data Sharing Patterns for In-network Data Caching

Copps, Elizabeth; Zhang, Huiyi; Sim, Alex; Wu, Kesheng; Monga, Inder; Guok, Chin; Würthwein, F.; Davila, Diego; Fajardo, Edgar

doi:10.1145/3452411.3464441

Cited by 5 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The caching approach improves overall application performance by decreasing data access latency and increasing data access throughput. It also reduces traffic over the wide-area network by decreasing the number of repeated data transfers [10][11][12].…”

Section: Introductionmentioning

confidence: 99%

Predicting Resource Utilization Trends with Southern California Petabyte Scale Cache

Sim,

Wu,

Sim

et al. 2024

EPJ Web of Conf.

View full text Add to dashboard Cite

Large community of high-energy physicists share their data all around world making it necessary to ship a large number of files over wide- area networks. Regional disk caches such as the Southern California Petabyte Scale Cache have been deployed to reduce the data access latency. We observe that about 94% of the requested data volume were served from this cache, without remote transfers, between Sep. 2022 and July 2023. In this paper, we show the predictability of the resource utilization by exploring the trends of recent cache usage. The time series based prediction is made with a machine learning approach and the prediction errors are small relative to the variation in the input data. This work would help understanding the characteristics of the resource utilization and plan for additional deployments of caches in the future.

show abstract

Section: Introductionmentioning

confidence: 99%

Predicting Resource Utilization Trends with Southern California Petabyte Scale Cache

Sim,

Wu,

Sim

et al. 2024

EPJ Web of Conf.

View full text Add to dashboard Cite

show abstract

“…To take advantage of this resue, the High-Energy Physics (HEP) community has established a number of regional storage caches [6,7,13]. Analyses show that these caches could significantly reduce the data access latency as well as the traffic on the internet backbone [4].…”

Section: Introductionmentioning

confidence: 99%

“…Adding more cache nodes to an already full distributed cache invariably leads to skewed distributions of data access patterns. This happened around Aug. 26, 2021 when 7 new nodes at Caltech (xrd [3][4][5][6][7][8]11) are added to the system, and around Sep. 30, 2021 when 2 new nodes at Caltech (xrd 9-10) are added to the system. The new cache nodes get the new data.…”

Section: Introductionmentioning

confidence: 99%

Access Trends of In-network Cache for Scientific Data

Han,

Sim,

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects. Thus, there is a significant amount of data sharing possible. In this work, we study the access traces of a federated storage cache known as the Southern California Petabyte Scale Cache. By studying the access patterns and potential for network traffic reduction by this caching system, we aim to explore the predictability of the cache uses and the potential for a more general in-network data caching. Our study shows that this distributed storage cache is able to reduce the network traffic volume by a factor of 2.35 during a part of the study period. We further show that machine learning models could predict cache utilization with an accuracy of 0.88. This demonstrates that such cache usage is predictable, which could be useful for managing complex networking resources such as in-network caching. CCS CONCEPTS• Networks → Network performance analysis; • Computing methodologies → Distributed computing methodologies.

show abstract

“…As a first step, this can be set independently on any machine that should cache data [138]. Furthermore, it can also be used to coordinate multiple caching nodes forming a federated system (a technology called XCache commonly used in grid sites [139,140]). Although this caching system is common enough in HEP computing environments, not often can it be found used directly by the analysis framework, whereas usually it is activated at the level of the grid site.…”

Section: State Of the Artmentioning

confidence: 99%

Distributed Computing Solutions for High Energy Physics Interactive Data Analysis

Padulano¹

View full text Add to dashboard Cite

The scientific research in High Energy Physics (HEP) is characterised by complex computational challenges, which over the decades had to be addressed by researching computing techniques in parallel to the advances in understanding physics. One of the main actors in the field, CERN, hosts both the Large Hadron Collider (LHC) and thousands of researchers yearly who are devoted to collecting and processing the huge amounts of data generated by the particle accelerator. This has historically provided a fertile ground for distributed computing techniques, which led to the creation of the Worldwide LHC Computing Grid (WLCG), a global network providing large computing power for all the experiments revolving around the LHC and the HEP field. Data generated by the LHC so far has already posed challenges for computing and storage. vii de recerca sobre aquesta proposta. Des del punt de vista de l'usuari, això es detalla en forma duna nova interfície que es pot executar en un ordinador portàtil o en milers de nodes informàtics, sense canvis en l'aplicació de l'usuari. Aquest desenvolupament obre la porta a l'explotació de recursos distribuïts a través de motors d'execució estàndard de la indústria que poden escalar a múltiples nodes en clústers HPC o HTC, o fins i tot en ofertes serverless de núvols comercials. Atès que sovint l'anàlisi de dades en aquest camp està limitada per E/S, cal comprendre quins són els possibles mecanismes d'emmagatzematge en memòria cau. En aquest sentit, es va investigar un nou sistema d'emmagatzematge basat en la tecnologia d'emmagatzematge d'objectes com a objectiu per a la memòria cau.En conclusió, el futur de l'anàlisi de dades a HEP presenta reptes des de diverses perspectives, des de l'explotació de recursos informàtics i d'emmagatzematge distribuïts fins al disseny d'interfícies d'usuari ergonòmiques. Els marcs de programari han d'apuntar a l'eficiència i la facilitat d'ús, desvinculant la definició dels càlculs físics dels detalls d'implementació de la seva execució. Aquesta tesi s'emmarca en l'esforç col•lectiu de la comunitat HEP cap a aquests objectius, definint problemes i possibles solucions que poden ser adoptades per futurs investigadors. viii

show abstract

Analyzing Scientific Data Sharing Patterns for In-network Data Caching

Cited by 5 publications

References 18 publications

Predicting Resource Utilization Trends with Southern California Petabyte Scale Cache

Predicting Resource Utilization Trends with Southern California Petabyte Scale Cache

Access Trends of In-network Cache for Scientific Data

Distributed Computing Solutions for High Energy Physics Interactive Data Analysis

Contact Info

Product

Resources

About