Given a set of files that show a certain degree of similarity, we consider a novel problem of deduplicating them (eliminating redundant chunks) across a set of distributed servers in a manner that is: (i) space-efficient: the total space needed to deduplicate and store the files is minimized and, (ii) access-efficient: each file can be accessed by communicating with a bounded number of servers, thereby minimizing networkaccess times in congested data center networks. A spaceoptimal solution in which we first deduplicate all the files and then distribute them across the servers (referred to as chunkdistribution), may require communication with many servers to access each file. On the other hand, an access-efficient solution in which we randomly partition the files cross the servers, and then store their unique chunks on each server may not exploit the similarities across files to reduce the space overhead. In this paper, we first show that finding an access-efficient, space optimal solution is an NP-Hard problem. Following this, we present the similarity-aware-partitioning (SAP) algorithms that find access-efficient solutions within polynomial time complexity and guarantees bounded space overhead for arbitrary files. Our experimental verification on files from Dropbox and CNN confirm that the SAP technique is much more space-efficient than random partitioning, while maintaining compression ratio close to the chunk-distribution solution.
Abstract-The paper describes a technique to tolerate faults in large data structures hosted on distributed servers, based on the concept of fused backups. The prevalent solution to this problem is replication. To tolerate f crash faults (dead/unresponsive data structures) among n distinct data structures, replication requires f + 1 replicas of each data structure, resulting in nf additional backups. We present a solution, referred to as fusion that uses a combination of erasure codes and selective replication to tolerate f crash faults using just f additional fused backups. We show that our solution achieves O(n) savings in space over replication. Further, we present a solution to tolerate f Byzantine faults (malicious data structures), that requires only nf + f backups as compared to the 2nf backups required by replication. We ensure that the overhead for normal operation in fusion is only as much as the overhead for replication. Though recovery is costly in fusion, in a system with infrequent faults, the savings in space outweighs the cost of recovery. We explore the theory of fused backups and provide a library of such backups for all the data structures in the Java Collection Framework. Our experimental evaluation confirms that fused backups are space-efficient as compared to replication (approximately n times), while they cause very little overhead for updates. To illustrate the practical usefulness of fusion, we use fused backups for reliability in Amazon's highly available key-value store, Dynamo. While the current replicationbased solution uses 300 backup structures, we present a solution that only requires 120 backup structures. This results in savings in space as well as other resources such as power.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.