Checkpoint-recovery based Virtual Machine (VM) replication is an emerging approach towards accommodating VM installations with high availability. However, it comes with the price of significant performance degradation of the application executed in the VM due to the large amount of state that needs to be synchronized between the primary and the backup machines. It is therefore critical to find new ways for attaining good performance, and at the same time, maintaining fault tolerant execution. In this paper, we present a novel approach to improve the performance of services deployed over replicated virtual machines by exploiting data similarity within the VM's memory image to reduce the network traffic during synchronization. For identifying similar memory areas, we propose a bit density based hash function, upon which, we build a content addressable hash table. We present a quantitative analysis on the degree of similarity we found in various workloads, and introduce a lightweight compression method, which, compared to existing replication techniques, reduces network traffic by up to 80% and yields a performance improvement over 90% for certain latency sensitive applications.
I. INTRODUCTIONWith the recent increase in cloud computing's prevalence, the number of online services deployed over virtualized infrastructures has experienced a tremendous growth. At the same time, however, the latest hardware trend of growing component number in current computing systems renders hardware failures common place rather than exceptional [1]. Replication at the Virtual Machine Monitor (VMM) layer is an attractive technique to ensure fault tolerance in such environments, primarily, because it provides seamless failover for the entire software stack executed inside the Virtual Machine (VM), regardless the application or the underlying operating system. One particular approach, checkpoint-recovery based VM replication, has gained a lot of attention recently [2], [3], [4], [5].Checkpoint-recovery based replication of virtual machines is attained by capturing the entire execution state of the running VM at relatively high frequency in order to propagate changes to the backup machine almost instantly. Essentially, it keeps the backup machine nearly up-to-date with the latest execution state of the primary machine so that the backup can take over the execution in case the primary fails [2].Between checkpoints the VM executes in log-dirty mode, i.e., write accessed pages are recorded so that when the snapshot is taken only pages that were modified in the most