We introduce a dynamical model of node repair in distributed storage systems wherein the storage nodes are subjected to failures according to independent Poisson processes. The main parameter that we study is the time-average capacity of the network in the scenario where a fixed subset of the nodes support a higher repair bandwidth than the other nodes. The sequence of node failures generates random permutations of the nodes in the encoded block, and we model the state of the network as a Markov random walk on permutations of n elements. As our main result we show that the capacity of the network can be increased compared to the static (worst-case) model of the storage system, while maintaining the same (average) repair bandwidth, and we derive estimates of the increase. We also quantify the capacity increase in the case that the repair center has information about the sequence of the recently failed storage nodes.
I. IntroductionThe problem of node repair based on erasure coding for distributed storage aims at optimizing the tradeoff of network traffic and storage overhead. In this form it was established by [8] from the perspective of network coding. This model was generalized in various ways such as concurrent failure of several nodes [6], heterogeneous architecture [2], [17], cooperative repair [13], and others. The existing body of works focuses on the failure of a node (or several nodes) and the ensuing reconstruction process, but puts less emphasis on the time evolution of the entire network and the inherent stochastic nature of the node failures. The static point of view of the system and of node repair leads to schemes based on the worst case scenario in the sense that the amount of data to be stored is known in advance, the amount of data each node transmits is known, and the repair capacity is determined by the least advantageous state of the network. Switching to evolving networks makes it possible to define and study the average amount of data moved through the network to accomplish repair, and may give slightly more comprehensive view of the system. Several models of storage systems have been considered in the literature. The basic model of [8] assumes that the amount of data that each node transmits to the repair center is fixed. The analysis of the network traffic and storage overhead relies on [1] which quantifies the maximum total amount of data (or flow) that can arrive at a specific point, but does not specify the exact amount of data that each node should transmit at each time instant. To use the communication bandwidth more efficiently, we assume the amount of data that each node transmits changes over time, while the total amount of communicated information averaged over multiple repair cycles is fixed.A similar idea appears, although not explicitly, in [16], where the authors propose to perform repair of several failed nodes within one repair cycle with the purpose of decreasing the network traffic. The decrease can occur if the information sent over a particular link can be used for repai...