The increasing popularity of Cloud computing as an attractive alternative to classic information processing systems has increased the importance of its correct and continuous operation even in the presence of faulty components. In this paper, we introduce an innovative, system-level, modular perspective on creating and managing fault tolerance in Clouds. We propose a comprehensive high-level approach to shading the implementation details of the fault tolerance techniques to application developers and users by means of a dedicated service layer. In particular, the service layer allows the user to specify and apply the desired level of fault tolerance, and does not require knowledge about the fault tolerance techniques that are available in the envisioned Cloud and their implementations.
A number of techniques have been proposed to provide runtime performance guarantees while minimizing power consumption. One drawback of existing approaches is that they work only on a fixed set of components (or actuators) that must be specified at design time. If new components become available, these management systems must be redesigned and reimplemented. In this paper, we propose PTRADE, a novel performance management framework that is general with respect to the components it manages. PTRADE can be deployed to work on a new system with different components without redesign and reimplementation. PTRADE's generality is demonstrated through the management of performance goals for a variety of benchmarks on two different Linux/x86 systems and a simulated 128-core system, each with different components governing power and performance tradeoffs. Our experimental results show that PTRADE provides generality while meeting performance goals with low error and close to optimal power consumption.
This paper presents the adoption of the Triple Modular Redundancy coupled with the Partial Dynamic Reconfiguration of Field Programmable Gate Arrays to mitigate the effects of Soft Errors in such class of device platforms. We propose an exploration of the design space with respect to several parameters (e.g., area and recovery time) in order to select the most convenient way to apply this technique to the device under consideration. The application to a case study is presented and used to exemplify the proposed approach.
Autonomic computing systems are capable of adapting their behavior and resources thousands of times a second to automatically decide the best way to accomplish a given goal despite changing environmental conditions and demands. Different decision mechanisms are considered in the literature, but in the vast majority of the cases a single technique is applied to a given instance of the problem. This article proposes a comparison of some state of the art approaches for decision making, applied to a self-optimizing autonomic system that allocates resources to a software application. A variety of decision mechanisms, from heuristics to control-theory and machine learning, are investigated. The results obtained with these solutions are compared by means of case studies using standard benchmarks. Our results indicate that the most suitable decision mechanism can vary depending on the specific test case but adaptive and model predictive control systems tend to produce good performance and may work best in a priori unknown situations. A. 2012. Comparison of decision-making strategies for self-optimization in autonomic computing systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.