Distributed storage systems provide reliable access to data through redundancy spread over individually unreliable nodes. Application scenarios include data centers, peer-to-peer storage systems, and storage in wireless networks. Storing data using an erasure code, in fragments spread across nodes, requires less redundancy than simple replication for the same level of reliability. However, since fragments must be periodically replaced as nodes fail, a key question is how to generate encoded fragments in a distributed way while transferring as little data as possible across the network.For an erasure coded system, a common practice to repair from a node failure is for a new node to download subsets of data stored at a number of surviving nodes, reconstruct a lost coded block using the downloaded data, and store it at the new node. We show that this procedure is sub-optimal. We introduce the notion of regenerating codes, which allow a new node to download functions of the stored data from the surviving nodes. We show that regenerating codes can significantly reduce the repair bandwidth. Further, we show that there is a fundamental tradeoff between storage and repair bandwidth which we theoretically characterize using flow arguments on an appropriately constructed graph. By invoking constructive results in network coding, we introduce regenerating codes that can achieve any point in this optimal tradeoff.
Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -straggler nodes, system failures, or communication bottlenecks -but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is n, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of log n. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction α of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor of (α + 1 n )γ(n) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms.
In distributed storage systems where reliability is maintained using erasure coding, network codes can be designed to meet specific requirements.By Alexandros G. Dimakis, Member IEEE, Kannan Ramchandran, Fellow IEEE, Yunnan Wu, Member IEEE, and Changho Suh, Student Member IEEE ABSTRACT | Distributed storage systems often introduce redundancy to increase reliability. When coding is used, the repair problem arises: if a node storing encoded information fails, in order to maintain the same level of reliability we need to create encoded information at a new node. This amounts to a partial recovery of the code, whereas conventional erasure coding focuses on the complete recovery of the information from a subset of encoded packets. The consideration of the repair network traffic gives rise to new design challenges.Recently, network coding techniques have been instrumental in addressing these challenges, establishing that maintenance bandwidth can be reduced by orders of magnitude compared to standard erasure codes. This paper provides an overview of the research results on this topic.
Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -straggler nodes, system failures, or communication bottlenecks -but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is n, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of log n. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction α of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor of (α + 1 n )γ(n) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms. input data dist. store aggregate function Communication graph storage phase communication phase computation phase 1 2 3 1 2 3 Fig. 1: Conceptual diagram of the phases of distributed computation. The algorithmic workflow of distributed (potentially iterative) tasks, can be seen as receiving input data, storing them in distributed nodes, communicating data around the distributed network, and then computing locally a function at each distributed node. The main bottlenecks in this execution (communication, stragglers, system failures) can all be abstracted away by incorporating a notion of delays between these phases, denoted by ∆ boxes.reduced communication cost for data shuffling done in parallel machine learning algorithms. We show that when a constant fraction of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor Θ(γ(n)) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We would like to remark that a major innovation of our coding solutions is that they are woven into the fabric of the algorithmic design, and coding/decoding is performed over the representation field of the input data (e.g., floats or doubles).In sharp contrast to most coding ap...
A fast rate-distortion (R-D) optimal scheme for coding adaptive trees whose individual nodes spawn descendents forming a disjoint and complete basis cover for the space spanned by their parent nodes is presented. The scheme guarantees operation on the convex hull of the operational R-D curve and uses a fast dynamic programing pruning algorithm to markedly reduce computational complexity. Applications for this coding technique include R. Coefman et al.'s (Yale Univ., 1990) generalized multiresolution wavelet packet decomposition, iterative subband coders, and quadtree structures. Applications to image processing involving wavelet packets as well as discrete cosine transform (DCT) quadtrees are presented.
In this paper we provide an overview of rate-distortion R-D based optimization techniques and their practical application to image and video coding. We begin with a short discussion of classical rate-distortion theory and then we show h o w in many practical coding scenarios, such as in standards-compliant coding environments, resource allocation can be put in an R-D framework. We then introduce two popular techniques for resource allocation, namely, Lagrangian optimization and dynamic programming. After a discussion of these two techniques as well as some of their extensions, we conclude with a quick review of recent literature in these areas citing a number of applications related to image and video compression and transmission. We provide a number of illustrative b o xes to capture the salient points in our paper.
We address the problem of distributed source coding, i.e. compression of correlated sources that are not co-located and/or cannot communicate with each other to minimize their joint description cost. In this work we tackle the related problem of compressing a source that is correlated with another source which is however available only at the decoder. In contrast to prior information-theoretic approaches, we introduce a new constructive and practical framework for tackling the problem based on the judicious incorporation of channel coding principles into this source coding problem. We dub our approach as DIstributed Source Coding Using Syndromes (DISCUS). We focus in this paper on trellis-structured consructions of the framework to illustrate its utility. Simulation results con rm the power of DISCUS, opening up a new and exciting constructive playing-ground for the distributed source coding problem. For the distributed coding of correlated i.i.d. Gaussian sources that are noisy versions of each other with \correlation-SNR" in the range of 12 to 20 dB, the DISCUS method attains gains of 7-15 dB in SNR over the Shannon-bound using \naive" independent coding of the sources.
We introduce a new class of exact MinimumBandwidth Regenerating (MBR) codes for distributed storage systems, characterized by a low-complexity uncoded repair process that can tolerate multiple node failures. These codes consist of the concatenation of two components: an outer MDS code followed by an inner repetition code. We refer to the inner code as a Fractional Repetition code since it consists of splitting the data of each node into several packets and storing multiple replicas of each on different nodes in the system.Our model for repair is table-based, and thus, differs from the random access model adopted in the literature. We present constructions of Fractional Repetition codes based on regular graphs and Steiner systems for a large set of system parameters. The resulting codes are guaranteed to achieve the storage capacity for random access repair. The considered model motivates a new definition of capacity for distributed storage systems, that we call Fractional Repetition capacity. We provide upper bounds on this capacity, while a precise expression remains an open problem.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.