This thesis deals with the problem of managing failures on distributed systems, specially on computer networks and high performance computing clusters. Through it, I expose and analyze the importance of the problem and how its current research landscape, while extensive, is fragmented, isolated and takes a too narrow approach. Specially, there is a gap of knowledge between academic and industrial problems and the need for a human expert and all of the problems that this entails have been overlooked. Based on this situation, I take two real datasets, a public one, detailing errors occurred on a supercomputer at Los Alamos, USA, and the Los resultados muestran que mis propuestas son capaces de conseguir soluciones exitosas con una interacción humana mínima, además de satisfacer los requerimientos y limitaciones técnicas. iv I can't start this section in any other way than recognizing the massive debt I owe to Juan Carlos Dueñas, my thesis advisor, boss and friend for the last five years. My life has changed in more ways that I could reflect here and I've learned and grown along this years. And I owe it all to you. A deep, sincere, thanks. And I also could not continue without thanking my workmates (and friends