Software rejuvenation is the concept of gracefully terminating an application and immediately restarting it at a clean internal state. In a client-server type of application where the server is intended to run perpetually for providing a service to its clients, rejuvenating the server process periodically during the most idle time of the server increases the availability of that service. In a long-running computation-intensive application, rejuvenating the application periodically and restarting it at a previous checkpoint increases the likelihood of successfully completing the application execution. We present a model f o r analyzing software rejuvenation in such continuously-running applicataons and express downtime and costs due to downtime during rejuvneation an t e r n s of the parameters in that model. Threshold conditions f o r rejuvenation to be beneficial are also derived. W e implemented a reusable module to perform software rejuvenation. That module can be embedded in any existing application on a UNIX platform with minimal efort. Experiences with software rejuvenation in a billing data collection subsystem of a telecommunications operations system and other continuouslyrunning systems and scientific applications in A T & T are described.
Checkpointing with rollback-recovery is a w ell known technique to reduce the completion time of a program in the presence of failures. While checkpointing is corrective in nature, rejuvenation refers to preventive maintenance of software aimed to reduce unexpected failures mostly resulting from the \aging" phenomenon. In this paper, we s h o w h o w both these techniques may be used together to further reduce the expected completion time of a program. The idea of using checkpoints to reduce the amount of rollback upon a failure is taken a step further by c o m bining it with rejuvenation. We derive the equations for expected completion time of a program with nite failure free running time for the following three cases when (a) neither checkpointing nor rejuvenation is employed, (b) only checkpointing is employed, and nally (c) both checkpointing and rejuvenation are employed. We also present n umerical results for Weibull failure time distribution for the above three cases and discuss optimal checkpointing and rejuvenation that minimizes the expected completion time. Using the numerical results, some interesting conclusions are drawn about bene ts of these techniques in relation to the nature of failure distribution.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.