Owing to long serving time and huge numbers of clients, Internet services can easily suffer from transient faults. Although restarting a service can solve this problem, information of the on-line requests will be lost owing to the service restart, which is unacceptable for many commercial or transaction-based services. In this paper, we propose an approach to achieve the goal of zero-loss restart for Internet services. Under this approach, a kernel subsystem is responsible for detecting the transient faults, retaining the I/O channels of the service, and managing the service restart flow. In addition, some straightforward modifications to the service should be made to take advantage of the kernel support. To demonstrate the feasibility of our approach, we implemented the subsystem in the Linux kernel. Moreover, we modified a Web server and a CGI program to take advantage of the kernel support. According to the experimental results, our approach incurs little runtime overhead (i.e. less than 3.2%). When the service crashes, it can be restarted quickly (i.e. within 210 µs) with no information loss. Furthermore, the performance impact due to the service crash is small. These results show that the approach can efficiently achieve the goal of zero-loss restart for Internet services. Copyright KERNEL SUPPORT FOR ZERO-LOSS INTERNET SERVICE RESTART 835 RELATED WORKIn this section, we describe the previous works that were used or can be used for building fault-tolerant Internet service systems.Checkpointing [11] is one of the most well-known approaches for system recovery. It checkpoints the software state into a stable storage. When a fault occurs, the system can be recovered from the last checkpointed state. This approach can be applied on different levels, such as user library level [12,13], compiler level [14-16], operating system level [17,18], and hardware level [19]. Although this approach can recover a system from transient faults, it is not suitable for service applications that contain hard-to-be-detected bugs, which cause them to crash after a long time of execution. This is because the recovered state is aged, instead of fresh, and thus the service may crash again immediately after the recovery. In this situation, to restart a fresh copy of the service is a more suitable approach. In addition, many checkpoint techniques incur large overheads owing to the large amount of the checkpointed state and the access to the stable storage. Since our approach does not address server node failures but only service faults, we use memory for state storage. Moreover, we can reduce the amount of state that needs to be saved through application-kernel cooperation.The concept of developing recovery-oriented software for dealing with errors was proposed by the Recovery-Oriented Computing (ROC) project [20], which is a joint effort of University College Berkeley and Stanford University. Different from the previous research, which usually addressed the Mean Time to Failure (MTTF), ROC offered high availability by reducing the Mean...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.