This thesis deals with the problem of managing failures on distributed systems, specially on computer networks and high performance computing clusters. Through it, I expose and analyze the importance of the problem and how its current research landscape, while extensive, is fragmented, isolated and takes a too narrow approach. Specially, there is a gap of knowledge between academic and industrial problems and the need for a human expert and all of the problems that this entails have been overlooked. Based on this situation, I take two real datasets, a public one, detailing errors occurred on a supercomputer at Los Alamos, USA, and the Los resultados muestran que mis propuestas son capaces de conseguir soluciones exitosas con una interacción humana mínima, además de satisfacer los requerimientos y limitaciones técnicas. iv I can't start this section in any other way than recognizing the massive debt I owe to Juan Carlos Dueñas, my thesis advisor, boss and friend for the last five years. My life has changed in more ways that I could reflect here and I've learned and grown along this years. And I owe it all to you. A deep, sincere, thanks. And I also could not continue without thanking my workmates (and friends
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.