Dmitry Grin scite author profile

Dmitry Grin

4Publications

5Citation Statements Received

7Citation Statements Given

How they've been cited

How they cite others

Affiliations

Kurchatov Institute

Publications

Order By: Most citations

Clustering error messages produced by distributed computing infrastructure during the processing of high energy physics data

Grigorieva

Grin

2021

Int. J. Mod. Phys. A

View full text Add to dashboard Cite

Large-scale distributed computing infrastructures ensure the operation and maintenance of scientific experiments at the LHC: more than 160 computing centers all over the world execute tens of millions of computing jobs per day. ATLAS — the largest experiment at the LHC — creates an enormous flow of data which has to be recorded and analyzed by a complex heterogeneous and distributed computing environment. Statistically, about 10–12% of computing jobs end with a failure: network faults, service failures, authorization failures, and other error conditions trigger error messages which provide detailed information about the issue, which can be used for diagnosis and proactive fault handling. However, this analysis is complicated by the sheer scale of textual log data, and often exacerbated by the lack of a well-defined structure: human experts have to interpret the detected messages and create parsing rules manually, which is time-consuming and does not allow identifying previously unknown error conditions without further human intervention. This paper is dedicated to the description of a pipeline of methods for the unsupervised clustering of multi-source error messages. The pipeline is data-driven, based on machine learning algorithms, and executed fully automatically, allowing categorizing error messages according to textual patterns and meaning.

show abstract

Transformer-Based Model for the Semantic Parsing of Error Messages in Distributed Computing Systems in High Energy Physics

Grin¹,

Grigorieva²

2021

View full text Add to dashboard Cite

Large-scale computing centers supporting modern scientific experiments store and analyze vast amounts of data. A noticeable number of computing jobs executed within the complex distributed computing environments ends with errors of some kind, and the amount of error log data generated every day complicates manual analysis by human experts. Moreover, traditional methods such as specifying regular expression patterns to automatically group error messages become impractical in a heterogeneous computing environment without a well-defined structure of error messages. ClusterLogs framework for error message clustering was developed to address this challenge. Theframework can discover common patterns in error messages from various sources and group them together. One of the essential results of this process is the clear automated description of the resulting clusters, which will be used for the analysis. In this research, we propose that interpreting error messages as a natural language allows us to use transformer-based deep learning models such as BERT for this task. A model for extracting the relevant part of messages was trained and integrated into ClusterLogs to represent each cluster as a few actionable items, ensuring better interpretation and validation of the results of clustering.

show abstract

Enhancements in Functionality of the Interactive Visual Explorer for ATLAS Computing Metadata

et al. 2020

View full text Add to dashboard Cite

The development of the Interactive Visual Explorer (InVEx), a visual analytics tool for the computing metadata of the ATLAS experiment at LHC, includes research of various approaches for data handling both on server and client sides. InVEx is implemented as a web-based application which aims at the enhancing of analytical and visualization capabilities of the existing monitoring tools and facilitates the process of data analysis with the interactivity and human supervision. The current work is focused on the architecture enhancements of the InVEx application. First, we will describe the user-manageable data preparation stage for cluster analysis. Then, the Level-of-Detail approach for the interactive visual analysis will be presented. It starts with the low detailing, when all data records are grouped (by clustering algorithms or by categories) and aggregated. We provide users with means to look deeply into this data, incrementally increasing the level of detail. Finally, we demonstrate the development of data storage backend for InVEx, which is adapted for the Level-of-Detail method to keep all stages of data derivation sequence.

show abstract

Visual Analysis Application for the Error Messages Clustering Framework

Grin

Grigorieva

Artamonov

2021

Procedia Computer Science

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Dmitry Grin

Clustering error messages produced by distributed computing infrastructure during the processing of high energy physics data

Transformer-Based Model for the Semantic Parsing of Error Messages in Distributed Computing Systems in High Energy Physics

Enhancements in Functionality of the Interactive Visual Explorer for ATLAS Computing Metadata

Visual Analysis Application for the Error Messages Clustering Framework

Contact Info

Product

Resources

About