Frank Mueller scite author profile

Abstract. The determination of upper bounds on execution times, commonly called WorstCase Execution Times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools and research prototypes.

show abstract

Proactive fault tolerance for HPC with Xen virtualization

Nagarajan

et al. 2007

View full text Add to dashboard Cite

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.

show abstract

ScalaTrace: Scalable compression and replay of communication traces for high-performance computing

Noeth

Ratn

Mueller

et al. 2009

Journal of Parallel and Distributed Computing

111

View full text Add to dashboard Cite

Characterizing the communication behavior of large-scale applications is a difficult and costly task due to code/system complexity and long execution times. While many tools to study this behavior have been developed, these approaches either aggregate information in a lossy way through high-level statistics or produce huge trace files that are hard to handle.We contribute an approach that provides orders of magnitude smaller, if not near-constant size, communication traces regardless of the number of nodes while preserving structural information. We introduce intra-and inter-node compression techniques of MPI events that are capable of extracting an application's communication structure. We further present a replay mechanism for the traces generated by our approach and discuss results of our implementation for BlueGene/L. Given this novel capability, we discuss its impact on communication tuning and beyond. To the best of our knowledge, such a concise representation of MPI traces in a scalable manner combined with deterministic MPI call replay are without any precedent.Key words: High-Performance Computing, Scalability, Communication Tracing PACS: 07.05.Bx An earlier version of this paper appeared at IPDPS'07 [20]. This journal version extends the earlier paper by novel domain-specific intra-and inter-node compression techniques, a completely redesigned inter-node merge algorithm, novel results with a larger class of codes resulting in near-constant trace sizes, a study to identify the timestep loop and extended related work.

show abstract

Timing analysis for data caches and set-associative caches

White

Mueller²,

Healy³

et al.

110

108

View full text Add to dashboard Cite

The contributions of this paper are twofold. First, an automatic tool-based approach is described to bound worst-case data cache performance. The gaven approach works on fully optimized code, performs the analysis over the entire control flow of a program, de-tects and exploits both spatial and temporal locality within data references, produces results typically within a few seconds, and estimates, on average, 30% tighter WCET bounds than can be predicted without analyzing data cache behavior. Results obtained by running the system on representative programs are presented and indicate that timing analysis of data cache behavior can result in significantly tighter worst-case performance predictions. Second, a framework to bound worst-case instruction cache performance for set-associative caches is formally introduced and operationally described. Results of incorporating instruction cache predictions within pipeline simulation show that timing predictions for set-associative caches remain just as tight as predictions for direct-mapped caches. The cache simulation overhead scales linearly with increasing associativity.

show abstract

Bounding pipeline and instruction cache performance

Healy

Arnold²,

Mueller

et al. 1999

IEEE Trans. Comput.

158

111

View full text Add to dashboard Cite

show abstract

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Fiala¹,

Mueller²,

Engelmann³

et al. 2012

116

101

View full text Add to dashboard Cite

Abstract-Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to detecting soft errors within MPI application while providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI message data between replicas, we study the best suited protocols for detecting and correcting MPI data that is the result of corruption.To experimentally validate our proposed detection and correction protocols, we introduce RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source by utilizing either double or triple redundancy.Our results indicate that our most efficient consistency protocol can successfully protect applications experiencing even high rates of silent data corruption with runtime overheads between 0% and 30% as compared to unprotected applications without redundancy.Using our fault injector within RedMPI, we observe that even a single soft error can have profound effects on running applications, causing a cascading pattern of corruption in most cases causes that spreads to all other processes. RedMPI's protection has been shown to successfully mitigate the effects of soft errors while allowing applications to complete with correct results even in the face of errors.

show abstract

Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Vetter

Mueller

2003

Journal of Parallel and Distributed Computing

108

View full text Add to dashboard Cite

This paper examines the explicit communication characteristics of several sophisticated scientific applications; which, by themselves,. constitute a representative suite of publicly available benchmarks for large cluster architectures. Bf~ focusing on the Message Passing Interface (MPI) and by using hardware counters on the microprocessor, we observe each application's inherent behavioral characteristics: point-to-point and collective communication, and floating-point operations. Furthermore, we e,xplore the sensitivities of these characteristics to both problem size and number of processors. Our analysis reveals several striking similarities across, our diverse set of applications including the use of collective 'operations, especially those collectives with very small data payloads. We also highlight a trend of novel applications parting with regimented, static communication patterns in favor of dynamically evolving patterns, as evidenced by our experiments on applications that use implicit linear ..solvers and adaptive mesh refinement. Overall, our study contributes a better understanding of the requirements of.current and emerging paradigms of scientific computing in terms of their computation and communication demands.

show abstract

Combining Partial Redundancy and Checkpointing for HPC

et al. 2012

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Frank Mueller

The worst-case execution-time problem—overview of methods and survey of tools

Proactive fault tolerance for HPC with Xen virtualization

ScalaTrace: Scalable compression and replay of communication traces for high-performance computing

Timing analysis for data caches and set-associative caches

Bounding pipeline and instruction cache performance

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Combining Partial Redundancy and Checkpointing for HPC

Contact Info

Product

Resources

About