The large amount of energy consumed by Internet services represents significant and fast-growing financial and environmental costs. Increasingly, services are exploring dynamic methods to minimize energy costs while respecting their service-level agreements (SLAs). Furthermore, it will soon be important for these services to manage their usage of "brown energy" (produced via carbon-intensive means) relative to renewable or "green" energy. This paper introduces a general, optimization-based framework for enabling multi-data-center services to manage their brown energy consumption and leverage green energy, while respecting their SLAs and minimizing energy costs. Based on the framework, we propose a policy for request distribution across the data centers. Our policy can be used to abide by caps on brown energy consumption, such as those that might arise from Kyoto-style carbon limits, from corporate pledges on carbon-neutrality, or from limits imposed on services to encourage brown energy conservation. We evaluate our framework and policy extensively through simulations and real experiments. Our results show how our policy allows a service to trade off consumption and cost. For example, using our policy, the service can reduce brown energy consumption by 24% for only a 10% increase in cost, while still abiding by SLAs.
Cloud service providers operate multiple geographically distributed data centers. These data centers consume huge amounts of energy, which translate into high operating costs. Interestingly, the geographical distribution of the data centers provides many opportunities for cost savings. For example, the electricity prices and outside temperatures may differ widely across the data centers. This diversity suggests that intelligently placing load may lead to large cost savings. However, aggressively directing load to the cheapest data center may render its cooling infrastructure unable to adjust in time to prevent server overheating. In this paper, we study the impact of load placement policies on cooling and maximum data center temperatures. Based on this study, we propose dynamic load distribution policies that consider all electricity-related costs as well as transient cooling effects. Our evaluation studies the ability of different cooling strategies to handle load spikes, compares the behaviors of our dynamic cost-aware policies to cost-unaware and static policies, and explores the effects of many parameter settings. Among other interesting results, we demonstrate that (1) our policies can provide large cost savings, (2) load migration enables savings in many scenarios, and (3) all electricity-related costs must be considered at the same time for higher and consistent cost savings.
Recent research has found that operators frequently misconfigure Internet services, causing various availability and performance problems. In this paper, we propose a software infrastructure that eliminates several types of misconfiguration by automating the generation of configuration files in Internet services, even as the services evolve. The infrastructure comprises a custom scripting language, configuration file templates, communicating runtime monitors, and heuristic algorithms to detect dependencies between configuration parameters and select ideal configurations. To demonstrate our infrastructure experimentally, we apply it to a realistic online auction service. Our results show that the infrastructure can simplify operation significantly while eliminating 58% of the misconfigurations found in a previous study of the same service. Furthermore, our results show that the infrastructure can efficiently determine the configuration parameters that lead to high performance as the service evolves through a hardware upgrade and the scheduled maintenance of a few nodes.
We demonstrate a framework for improving the availability of cluster based Internet services. Our approach models Internet services as a collection of interconnected components, each possessing well defined interfaces and failure semantics. Such a decomposition allows designers to engineer high availability based on an understanding of the interconnections and isolated fault behavior of each component, as opposed to ad-hoc methods. In this work, we focus on using the entire commodity workstation as a component because it possesses natural, fault-isolated interfaces. We define a failure event as a reboot because not only is a workstation unavailable during a reboot, but also because reboots are symptomatic of a larger class of failures, such as configuration and operator errors. Our observations of 3 distinct clusters show that the time between reboots is best modeled by a Weibull distribution with shape parameters of less than 1, implying that a workstation becomes more reliable the longer it has been operating. Leveraging this observed property, we design an allocation strategy which withholds recently rebooted workstations from active service, validating their stability before allowing them to return to service. We show via simulation that this policy leads to a 70-30 rule-of-thumb: For a constant utilization, approximately 70% of the workstation failures can be masked from end clients with 30% extra capacity added to the cluster, provided reboots are not strongly correlated. We also found our technique is most sensitive to the burstiness of reboots as opposed to absolute lengths of workstation uptimes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.