Proceedings of the 2nd Workshop on Parallel Programming for Analytics Applications 2015
DOI: 10.1145/2726935.2726944
|View full text |Cite
|
Sign up to set email alerts
|

Monitoring HPC applications in the production environment

Abstract: The advancement of HPC systems brings with it a need for more introspection into the run-time environment and performance of longrunning applications. Software and hardware fault tolerance, scaling performance issues, soft error effect on computations, and even large-scale computational progress will require more capable runtime monitoring of applications during production runs. Current HPC toolsets, however, are geared towards heavyweight program introspection during the development, debugging, and optimizati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2015
2015
2017
2017

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 20 publications
(18 reference statements)
0
1
0
Order By: Relevance
“…These situations can result in a significant waste of resources as the SMS is unaware of the problem, and thus cannot terminate the job. Various watchdog methods have been developed for detecting this situation, including requiring a periodic "heartbeat" from the application and monitoring a specified file for changes in size [15,18].…”
Section: Job Control and Monitoringmentioning
confidence: 99%
“…These situations can result in a significant waste of resources as the SMS is unaware of the problem, and thus cannot terminate the job. Various watchdog methods have been developed for detecting this situation, including requiring a periodic "heartbeat" from the application and monitoring a specified file for changes in size [15,18].…”
Section: Job Control and Monitoringmentioning
confidence: 99%