The cost of reconciling consistency and state management with high availability is highly magnified by the unprecedented scale and robustness requirements of today's Internet applications. We propose two strategies for improving overall availability using simple mechanisms that scale over large applications whose output behavior tolerates graceful degradation. We characterize this degradation in terms of harvest and yield, and map it directly onto engineering mechanisms that enhance availability by improving fault isolation, and in some cases also simplify programming. By collecting examples of related techniques in the literature and illustrating the surprising range of applications that can benefit from these approaches, we hope to motivate a broader research program in this area.
Motivation, Hypothesis, RelevanceIncreasingly, infrastructure services comprise not only routing, but also application-level resources such as search engines [15], adaptation proxies [8], and Web caches [20].These applications must confront the same 24 7 operational expectations and exponentially-growing user loads as the routing infrastructure, and consequently are absorbing comparable amounts of hardware and software. The current trend of harnessing commodity-PC clusters for scalability and availability [9] is reflected in the largest web server installations. These sites use tens to hundreds of PC's to deliver 100M or more read-mostly page views per day, primarily using simple replication or relatively small data sets to increase throughput.The scale of these applications is bringing the wellknown tradeoff between consistency and availability [4] into very sharp relief. In this paper we propose two general directions for future work in building large-scale robust systems. Our approaches tolerate partial failures by emphasizing simple composition mechanisms that promote fault containment, and by translating possible partial failure modes into engineering mechanisms that provide smoothlydegrading functionality rather than lack of availability of the service as a whole. The approaches were developed in the context of cluster computing, where it is well accepted [22] that one of the major challenges is the nontrivial software engineering required to automate partial-failure handling in order to keep system management tractable.