Known challenges for petascale machines are that (1) the costs of I/O for high performance applications can be substantial, especially for output tasks like checkpointing, and (2) noise from I/O actions can inject undesirable delays into the runtimes of such codes on individual compute nodes. This paper introduces the flexible 'DataStager' framework for data staging and alternative services within that jointly address (1) and (2). Data staging services moving output data from compute nodes to staging or I/O nodes prior to storage are used to reduce I/O overheads on applications' total processing times, and explicit management of data staging offers reduced perturbation when extracting output data from a petascale machine's compute partition. Experimental evaluations of DataStager on the Cray XT machine at Oak Ridge National Laboratory establish both the necessity of intelligent data staging and the high performance of our approach, using the GTC fusion modeling code and benchmarks running on 1000+ processors.
To effectively manage large-scale data centers and utility clouds, operators must understand current system and application behaviors. This requires continuous, real-time monitoring along with on-line analysis of the data captured by the monitoring system, i.e., integrated monitoring and analytics -Monalytics [28]. A key challenge with such integration is to balance the costs incurred and associated delays, against the benefits attained from identifying and reacting to, in a timely fashion, undesirable or non-performing system states. This paper presents a novel, flexible architecture for Monalytics in which such trade-offs are easily made by dynamically constructing software overlays called Distributed Computation Graphs (DCGs) to implement desired analytics functions. The prototype of Monalytics implementing this flexible architecture is evaluated with motivating use cases in small scale data center experiments, and a series of analytical models is used to understand the above trade-offs at large scales. Results show that the approach provides the flexibility to meet the demands of autonomic management at large scale with considerably better performance/cost than traditional and brute force solutions.
SUMMARYAdvances in high performance computing, communications and user interfaces enable developers to construct increasingly interactive high performance applications. The Falcon system presented in this paper supports such interactivity by providing runtime libraries, tools and user interfaces that permit the on-line monitoring and steering of large-scale parallel codes. The principal aspects of Falcon described in this paper are its abstractions and tools for capture and analysis of application-specific program information, performed on-line, with controlled latencies and scalable to parallel machines of substantial size. In addition, Falcon provides support for the on-line graphical display of monitoring information, and it allows programs to be steered during their execution, by human users or algorithmically. This paper presents our basic research motivation, outlines the Falcon system's functionality, and includes a detailed evaluation of its performance characteristics in light of its principal contributions. Falcon's functionality and performance evaluation are driven by our experiences with large-scale parallel applications being developed with end users in physics and in atmospheric sciences. The sample application highlighted in this paper is a molecular dynamics simulation program (MD) used by physicists to study the statistical mechanics of liquids.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.