The growing complexity of mass storage systems at major data centers is causing stress on system administrators to keep performance at optimal levels. As storage requirements grow, so does the number of routine tasks that the administrator must perform, as well as the time it takes for these to be executed. The solution being proposed to ease this burden is the Mass Storage System Administrator Autonomic Assistant (MSSAAA). The MSSAAA is a collection of agents that perform some of the more common tasks while the administrators handle higher-level issues. Using the principles of autonomic computing, the MSSAAA is governed by a centralized set of policies that the administrator will review on a regular basis and can adjust as necessary. The goal is to develop an autonomic assistant to substantially reduce the amount of time it takes to address specific problems in the system. Using tools such as IBM's Generic Log Adapter, Resource Model Builder, and Autonomic Management Engine, the MSSAAA has been able to (i) quickly determine when tape errors occur and correct them, (ii) monitor the network file system mounts for poor performance and report those, and (iii) correct network file system handle problems through continuous monitoring. The preliminary savings analyses show that the assistant saves the system administrator at least 185 hours per year, and over six thousand dollars in related costs. The results show how efficiently and effectively the MSSAAA handled its assigned tasks, and how it has eased the daily burden of storage system administrators. DEDICATION This thesis is dedicated to my wife Katie for her patience, understanding and constant support during the pursuit of my Master's degree. iii ACKNOWLEDGEMENTS There are many other groups of people who have provided feedback, insight and support at various levels.
Large-scale distributed systems are playing an increasing role in computational research, production operations, information processing, and application hosting. The continuous management of such systems is a critical consideration when focusing on reliability, availability, and security. As the number of commodity components within these systems continue to grow, it becomes increasingly difficult to track the multitude of parameters required to ensure optimal performance from the system, especially in those systems that have been built through expansion and not as an initial purchase of identical nodes. In this paper, we discuss the use of statistical inference, specifically Markov Logic Networks, in a distributed multi-agent system to provide the most effective means of managing these parameters. We showcase an architecture that provides services to manage a system's configuration throughout its life-cycle, and is capable of resolving differences after identifying potential mis-configurations using conflict discovery and resolution modules.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.