Latency Analysis for Distributed Coded Storage Systems

Badita, Ajay; Parag, Parimal; Chamberland, Jean-François

doi:10.1109/tit.2019.2909868

Cited by 24 publications

(41 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As discussed earlier, we model the query processing system as an (n, k) fork-join system where each query is forked to all n servers and is considered to be completed once k out of n responses are obtained. Analytically, (n, k) fork-join queues for homogeneous servers with single rates have been studied previously in the literature [10]- [12], [15], [31]- [36]. Even for the simple case of servers operating at a single service rate with i.i.d.…”

Section: Related Workmentioning

confidence: 99%

“…The problem remains open for systems with larger number of servers. For exponentially distributed service times and Poisson arrivals, bounds are presented in [11], [12], [31]- [33], [38], [39], and analytical approximations in [15]. Exact analysis for special case of large systems is considered in [35], [40], and small systems in [34].…”

Section: Related Workmentioning

confidence: 99%

“…We first observe that the Markov process under consideration can be thought of as a sequence of virtual queues in series, which pool their servers when idling. For a single process power state, this tandem queue has been studied in [14], [15], where the system state is a sequence of the number of applications that have been served by i k servers. Due to multiple processor power states and probabilistic slowdown, the resulting Markov process for our model is more complex.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Modeling Performance and Energy trade-offs in Online Data-Intensive Applications

Badita¹,

Jinan²,

Vamanan³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We consider energy minimization for data-intensive applications run on large number of servers, for given performance guarantees. We consider a system, where each incoming application is sent to a set of servers, and is considered to be completed if a subset of them finish serving it. We consider a simple case when each server core has two speed levels, where the higher speed can be achieved by higher power for each core independently. The core selects one of the two speeds probabilistically for each incoming application request. We model arrival of application requests by a Poisson process, and random service time at the server with independent exponential random variables. Our model and analysis generalizes to today's state-of-the-art in CPU energy management where each core can independently select a speed level from a set of supported speeds and corresponding voltages. The performance metrics under consideration are the mean number of applications in the system and the average energy expenditure. We first provide a tight approximation to study this previously intractable problem and derive closed form approximate expressions for the performance metrics when service times are exponentially distributed. Next, we study the trade-off between the approximate mean number of applications and energy expenditure in terms of the switching probability. We demonstrate that the numerically obtained curves are closely approximated by the expressions derived for the performance metrics.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Modeling Performance and Energy trade-offs in Online Data-Intensive Applications

Badita¹,

Jinan²,

Vamanan³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This implies that each job is forked to the identical number of servers, and job is completed by joining identical number of service completions. Tight numerical bounds are provided in [6], analytical bounds are presented in [7], [15]- [17], analytical approximations appear in [18], exact analysis for small systems in [19], exact analysis for random independent scheduling for asymptotically large number of servers in [20], and an exact analysis of tail index for Pareto-distributed file sizes in [21].…”

Section: A Related Workmentioning

confidence: 99%

Optimal Server Selection for Straggler Mitigation

Badita

Parag

Aggarwal

2020

IEEE/ACM Trans. Networking

Self Cite

View full text Add to dashboard Cite

The performance of large-scale distributed compute systems is adversely impacted by stragglers when the execution time of a job is uncertain. To manage stragglers, we consider a multi-fork approach for job scheduling, where additional parallel servers are added at forking instants. In terms of the forking instants and the number of additional servers, we compute the job completion time and the cost of server utilization when the task processing times are assumed to have a shifted exponential distribution. We use this study to provide insights into the scheduling design of the forking instants and the associated number of additional servers to be started. Numerical results demonstrate orders of magnitude improvement in cost in the regime of low completion times as compared to the prior works.

show abstract

“…Coding for minimizing latency has been considered on its own in a separate line of works starting with [12]. We refer to [4] for an overview of the literature where access latency is considered in the framework of queueing theory.…”

mentioning

confidence: 99%

Capacity of dynamical storage systems

Elishco

Barg

2019

2019 IEEE International Symposium on Information Theory (ISIT)

View full text Add to dashboard Cite

We introduce a dynamical model of node repair in distributed storage systems wherein the storage nodes are subjected to failures according to independent Poisson processes. The main parameter that we study is the time-average capacity of the network in the scenario where a fixed subset of the nodes support a higher repair bandwidth than the other nodes. The sequence of node failures generates random permutations of the nodes in the encoded block, and we model the state of the network as a Markov random walk on permutations of n elements. As our main result we show that the capacity of the network can be increased compared to the static (worst-case) model of the storage system, while maintaining the same (average) repair bandwidth, and we derive estimates of the increase. We also quantify the capacity increase in the case that the repair center has information about the sequence of the recently failed storage nodes. I. IntroductionThe problem of node repair based on erasure coding for distributed storage aims at optimizing the tradeoff of network traffic and storage overhead. In this form it was established by [8] from the perspective of network coding. This model was generalized in various ways such as concurrent failure of several nodes [6], heterogeneous architecture [2], [17], cooperative repair [13], and others. The existing body of works focuses on the failure of a node (or several nodes) and the ensuing reconstruction process, but puts less emphasis on the time evolution of the entire network and the inherent stochastic nature of the node failures. The static point of view of the system and of node repair leads to schemes based on the worst case scenario in the sense that the amount of data to be stored is known in advance, the amount of data each node transmits is known, and the repair capacity is determined by the least advantageous state of the network. Switching to evolving networks makes it possible to define and study the average amount of data moved through the network to accomplish repair, and may give slightly more comprehensive view of the system. Several models of storage systems have been considered in the literature. The basic model of [8] assumes that the amount of data that each node transmits to the repair center is fixed. The analysis of the network traffic and storage overhead relies on [1] which quantifies the maximum total amount of data (or flow) that can arrive at a specific point, but does not specify the exact amount of data that each node should transmit at each time instant. To use the communication bandwidth more efficiently, we assume the amount of data that each node transmits changes over time, while the total amount of communicated information averaged over multiple repair cycles is fixed.A similar idea appears, although not explicitly, in [16], where the authors propose to perform repair of several failed nodes within one repair cycle with the purpose of decreasing the network traffic. The decrease can occur if the information sent over a particular link can be used for repai...

show abstract

Latency Analysis for Distributed Coded Storage Systems

Cited by 24 publications

References 31 publications

Modeling Performance and Energy trade-offs in Online Data-Intensive Applications

Modeling Performance and Energy trade-offs in Online Data-Intensive Applications

Optimal Server Selection for Straggler Mitigation

Capacity of dynamical storage systems

Contact Info

Product

Resources

About