The monitoring of distributed systems involves the collection, interpretation, and display of information concerning the interactions among concurrently executing processes. This information and its display can support the debugging, testing, performance evaluation, and dynamic documentation of distributed systems. General problems associated with monitoring are outlined in this paper, and the architecture of a general purpose, extensible, distributed monitoring system is presented. Three approaches to the display of process interactions are described: textual traces, animated graphical traces, and a combination of aspects of the textual and graphical approaches. The roles that each of these approaches fulfill in monitoring and debugging distributed systems are identified and compared. Monitoring tools for collecting communication statistics, detecting deadlock, controlling the non-deterministic execution of distributed systems, and for using protocol specifications in monitoring are also described.
Our discussion is based on experience in the development and use of a monitoring system within a distributed programming environment called Jade. Jade was developed within the Computer Science Department of the University of Calgary and is now being used to support teaching and research at a number of university and research organizations.
Time Warp's optimistic scheduling requires the maintenance of simulation state history to support rollback in the event of causality violations. State history, and the ability to rollback the simulation, can provide unique functionality for human-in-the-loop simulation environments. This paper investigates the use of Time Warp to output valid simulation state in a near real-time manner, re-execute portions of the simulation, and interactively probe simulation values to ascertain underlying causes of transient behavior.A shared-memory, multi-threaded interactive simulation architecture is presented and the additional state saving requirements imposed by interactivity are examined. The shortcomings of existing state saving schemes lead us to propose Multiplexed State Saving (MSS). By interleaving checkpointing and incremental state logs MSS provides bounded rollback costs and asynchronous access to prior simulation state. The interaction algorithms and MSS form a scalable, bounded cost component suitable for use in a real-time interactive Time Warp system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.