Abstract. As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.
Lost sales inventory models with large lead times, which arise in many practical settings, are notoriously difficult to optimize due to the curse of dimensionality. In this paper we show that when lead times are large, a very simple constant-order policy, first studied by Reiman [39], performs nearly optimally. The main insight of our work is that when the lead time is very large, such a significant amount of randomness is injected into the system between when an order for more inventory is placed and when the order is received, that "being smart" algorithmically provides almost no benefit. Our main proof technique combines a novel coupling for suprema of random walks with arguments from queueing theory.
In this paper we consider the problem of scheduling different classes of customers on multiple distributed servers to minimize an objective function based on per-class mean response times. This problem arises in a wide range of distributed systems, networks and applications. Within the context of our model, we observe that the optimal sequencing strategy at each of the servers is a simple static priority policy. Using this observation, we argue that the globally optimal scheduling problem reduces to finding an optimal routing matrix under this sequencing policy. We formulate the latter problem as a nonlinear programming problem and show that any interior local minimum is a global minimum, which significantly simplifies the solution of the optimization problem. In the case of Poisson arrivals, we provide an optimal scheduling strategy that also tends to minimize a function of the per-class response time variances. Applying our analysis to various static instances of the general problem leads us to rederive many results, yielding simple approximation algorithms whose guarantees match the best known results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.