2016
DOI: 10.1088/1742-6596/762/1/012048
|View full text |Cite
|
Sign up to set email alerts
|

Predicting dataset popularity for the CMS experiment

Abstract: The CMS experiment at the LHC accelerator at CERN relies on its computing infrastructure to stay at the frontier of High Energy Physics, searching for new phenomena and making discoveries. Even though computing plays a significant role in physics analysis we rarely use its data to predict the system behavior itself. A basic information about computing resources, user activities and site utilization can be really useful for improving the throughput of the system and its management. In this paper, we discuss a f… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
12
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(14 citation statements)
references
References 7 publications
2
12
0
Order By: Relevance
“…For reference: PR curves plot precision versus recall, where precision = TP/(TP+FP) and recall = TP/(TP+FN); ROC curves plot FPR vs TPR, with FPR = FP/(FP+TN), and TPR = TP/(TP+FN). The main result of this part of the work is that the Spark Machine Learning code has been developed and tested with the dataset popularity data sets and shows results that are comparable to those obtained previously using scikit-learn [9,10]. It was demonstrated that a Spark platform can be used with no loss whatsoever in the quality of the model and its prediction, and in addition it allows to deal more efficiently with much larger dataframe (i.e.…”
Section: Cms Framework For Machine Learning Studiessupporting
confidence: 53%
See 3 more Smart Citations
“…For reference: PR curves plot precision versus recall, where precision = TP/(TP+FP) and recall = TP/(TP+FN); ROC curves plot FPR vs TPR, with FPR = FP/(FP+TN), and TPR = TP/(TP+FN). The main result of this part of the work is that the Spark Machine Learning code has been developed and tested with the dataset popularity data sets and shows results that are comparable to those obtained previously using scikit-learn [9,10]. It was demonstrated that a Spark platform can be used with no loss whatsoever in the quality of the model and its prediction, and in addition it allows to deal more efficiently with much larger dataframe (i.e.…”
Section: Cms Framework For Machine Learning Studiessupporting
confidence: 53%
“…Widely known classifiers like XGBClassifier and RandomForestClassifier performed in the model at the level of a TPR (True Positive Rate) of 90% or more, and a TNR True Negative Rate) of around 95% [9]. The study also showed at least a factor or 2.5 of difference in favour of RandomForestClassifier in the classifier running time [10].…”
Section: Cms Framework For Machine Learning Studiesmentioning
confidence: 95%
See 2 more Smart Citations
“…Today machine learning (ML) is becoming ubiquitous in HEP applications, most notably in final offline analyses, but it is also increasingly used in both online and offline reconstruction and particle identification algorithms, in the classification of reconstruction-level objects, such as jets, as well as in sectors that did not exploit ML up to 2-3 years ago, like the analysis of computing metadata for resource usage optimization [1,2]. It is difficult to predict how this will evolve in HEP.…”
Section: Introductionmentioning
confidence: 99%