Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. In these systems, it is certain that host failures and other faults will be a common occurrence. Unfortunateb, most parallel processing systems have not been designed with fault-tolerance in mind. Mentat is a high-performance objec t-oriented parallel processing system that is based on an extension of the data-flow model. The functional nature of data-flow enabies both parallelism and fault-tolerance. In this paper, we exploit the data-flow underpinning of Mentat to provide easy-to-use and transparent fault-tolerance. We present results on both a small-scale network and a wide-area heterogeneous environment that consists of three sites: the National Center for Supercomputing Applications, the University of Mrginia and the NASA Langley Research Center.
The Legion project at the University of Virginia is an attempt to provide system services that provide the illusion of a single virtual machine to users, a virtual machine that provides both improved response time via parallel execution and greater throughput. Legion is targeted towards both workstation clusters and towards larger, wide-area, assemblies of workstations, supercomputers, and parallel supercomputers. Rather than construct Legion from scratch we are extending an existing object-oriented parallel processing system by aggressively incorporating lessons learned over twenty years by the heterogeneous distributed systems community. The campus-wide virtual computer is an early Legion prototype. In this paper we present challenges that had to be overcome to realize a working CWVC, as well as performance on a production biochemistry application.
Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. In these systems, it is certain that host failures and other faults will be a common occurrence. Unfortunateb, most parallel processing systems have not been designed with fault-tolerance in mind. Mentat is a high-performance objec t-oriented parallel processing system that is based on an extension of the data-flow model. The functional nature of data-flow enabies both parallelism and faulttolerance. In this paper, we exploit the data-flow underpinning of Mentat to provide easy-to-use and transparent fault-tolerance. We present results on both a small-scale network and a wide-area heterogeneous environment that consists of three sites: the National Center for Supercomputing Applications, the University of Mrginia and the NASA Langley Research Center.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.