Predicting dataset popularity for the CMS experiment

Kuznetsov, Valentin; Li, Ting; Giommi, L.; Bonacorsi, D.; Wildish, T.

doi:10.1088/1742-6596/762/1/012048

Cited by 15 publications

(14 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For reference: PR curves plot precision versus recall, where precision = TP/(TP+FP) and recall = TP/(TP+FN); ROC curves plot FPR vs TPR, with FPR = FP/(FP+TN), and TPR = TP/(TP+FN). The main result of this part of the work is that the Spark Machine Learning code has been developed and tested with the dataset popularity data sets and shows results that are comparable to those obtained previously using scikit-learn [9,10]. It was demonstrated that a Spark platform can be used with no loss whatsoever in the quality of the model and its prediction, and in addition it allows to deal more efficiently with much larger dataframe (i.e.…”

Section: Cms Framework For Machine Learning Studiessupporting

confidence: 53%

“…Widely known classifiers like XGBClassifier and RandomForestClassifier performed in the model at the level of a TPR (True Positive Rate) of 90% or more, and a TNR True Negative Rate) of around 95% [9]. The study also showed at least a factor or 2.5 of difference in favour of RandomForestClassifier in the classifier running time [10].…”

Section: Cms Framework For Machine Learning Studiesmentioning

confidence: 95%

“…number of accesses to a dataset, number of users accessing it, CPU-hrs spent in accessing it, etc. A proper definition of the popularity concept in the context of preparing for the application of ML techniques has been subject of previous work [8,9]. The ability to predict the popularity of a CMS dataset is crucial as it allows to optimise the storage utilisation.…”

Section: Cms Framework For Machine Learning Studiesmentioning

confidence: 99%

“…Using the DCAF framework, it may take a couple of days to collect up to 12 months of data. Dataframes from 6, 9 and 12 months in the past were used in a test to train the classifiers, and a dedicated study demonstrated that 6 months is a sufficiently large time window in terms of quality of the model built with this data: extending to a large window would have increased the risk of unwanted overfitting effects [9,10]. As a consequence, the model was trained on 6 months worth of data to predict the popularity of CMS datasets in the up-coming week; then, this week was added to the training set and the first week in the past was dropped (thus yielding the time windows to indeed "slide"); then, the classifiers were re-trained and the predictions were extracted for the following week; and so on.…”

Section: Cms Framework For Machine Learning Studiesmentioning

confidence: 99%

See 3 more Smart Citations

Progress in Machine Learning Studies for the CMS Computing Infrastructure

Bonacorsi¹,

Kuznetsov²,

Magini³

et al. 2017

Proceedings of International Symposium on Grids and Clouds (ISGC) 2017 — PoS(ISGC2017)

View full text Add to dashboard Cite

Section: Cms Framework For Machine Learning Studiessupporting

confidence: 53%

Section: Cms Framework For Machine Learning Studiesmentioning

confidence: 95%

Section: Cms Framework For Machine Learning Studiesmentioning

confidence: 99%

Section: Cms Framework For Machine Learning Studiesmentioning

confidence: 99%

See 2 more Smart Citations

Progress in Machine Learning Studies for the CMS Computing Infrastructure

Bonacorsi¹,

Kuznetsov²,

Magini³

et al. 2017

Proceedings of International Symposium on Grids and Clouds (ISGC) 2017 — PoS(ISGC2017)

View full text Add to dashboard Cite

“…Today machine learning (ML) is becoming ubiquitous in HEP applications, most notably in final offline analyses, but it is also increasingly used in both online and offline reconstruction and particle identification algorithms, in the classification of reconstruction-level objects, such as jets, as well as in sectors that did not exploit ML up to 2-3 years ago, like the analysis of computing metadata for resource usage optimization [1,2]. It is difficult to predict how this will evolve in HEP.…”

Section: Introductionmentioning

confidence: 99%

Prototype of Machine Learning “as a Service” for CMS Physics in Signal vs Background discrimination

Giommi¹,

Bonacorsi²,

Kuznetsov³

2018

Proceedings of Sixth Annual Conference on Large Hadron Collider Physics — PoS(LHCP2018)

View full text Add to dashboard Cite

Big volumes of data are collected and analyzed by LHC experiments at CERN. The success of this scientific challenges is ensured by a great amount of computing power and storage capacity, operated over high-performance networks, in very complex computing models on the LHC computing grid infrastructure. Now in run-2 data taking, LHC has an ambitious and broad experimental programme for the coming decades: it includes large investments in detector hardware, and similarly it requires commensurate investment in the R&D in software and computing to acquire, manage, process and analyze the shear amounts of data to be recorded in the high-luminosity LHC (HL-LHC) era. The new rise of artificial intelligence-related to the current big data era, to the technological progress and to a bump in resources democratization and efficient allocation at affordable costs through cloud solutions-is posing new challenges but also offering extremely promising techniques, not only for the commercial world but also for scientific enterprises such as HEP experiments. Machine learning and deep learning are rapidly evolving approaches to characterising and describing data with the potential to radically change how data is reduced and analyzed, also at LHC. This work aims at contributing to the construction of a machine learning "as a service" solution for CMS physics needs, namely an end-to-end data-service to serve machine learning trained model to the CMS software framework. To this ambitious goal, this work contributes firstly with a proof of concept of a first prototype of such infrastructure, and secondly with a specific physics use-case: the signal versus background discrimination in the study of CMS all-hadronic top quark decays, done with scalable machine learning techniques.

show abstract