Hiep Nguyen scite author profile

Abstract-Distributed applications running inside cloud systems are prone to performance anomalies due to various reasons such as resource contentions, software bugs, and hardware failures. One big challenge for diagnosing an abnormal distributed application is to pinpoint the faulty components. In this paper, we present a black-box online fault localization system called FChain that can pinpoint faulty components immediately after a performance anomaly is detected. FChain first discovers the onset time of abnormal behaviors at different components by distinguishing the abnormal change point from many change points caused by normal workload fluctuations. Faulty components are then pinpointed based on the abnormal change propagation patterns and inter-component dependency relationships. FChain performs runtime validation to further filter out false alarms. We have implemented FChain on top of the Xen platform and tested it using several benchmark applications (RUBiS, Hadoop, and IBM System S). Our experimental results show that FChain can quickly pinpoint the faulty components with high accuracy within a few seconds. FChain can achieve up to 90% higher precision and 20% higher recall than existing schemes. FChain is nonintrusive and light-weight, which imposes less than 1% overhead to the cloud system.

show abstract

PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds

Dean

Nguyen

Wang

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Infrastructure-as-a-service clouds are becoming widely adopted. However, resource sharing and multi-tenancy have made performance anomalies a top concern for users. Timely debugging those anomalies is paramount for minimizing the performance penalty for users. Unfortunately, this debugging often takes a long time due to the inherent complexity and sharing nature of cloud infrastructures. When an application experiences a performance anomaly, it is important to distinguish between faults with a global impact and faults with a local impact as the diagnosis and recovery steps for faults with a global impact or local impact are quite different. In this paper, we present PerfCompass, an online performance anomaly fault debugging tool that can quantify whether a production-run performance anomaly has a global impact or local impact. PerfCompass can use this information to suggest the root cause as either an external fault (e.g., environment-based) or an internal fault (e.g., software bugs). Furthermore, PerfCompass can identify top affected system calls to provide useful diagnostic hints for detailed performance debugging. PerfCompass does not require source code or runtime application instrumentation, which makes it practical for production systems. We have tested PerfCompass by running five common open source systems (e.g., Apache, MySQL, Tomcat, Hadoop, Cassandra) inside a virtualized cloud testbed. Our experiments use a range of common infrastructure sharing issues and real software bugs. The results show that PerfCompass accurately classifies 23 out of the 24 tested cases without calibration and achieves 100% accuracy with calibration. PerfCompass provides useful diagnosis hints within several minutes and imposes negligible runtime overhead to the production system during normal execution time.

show abstract

Ubl

Dean

Nguyen

2012

121

View full text Add to dashboard Cite

Exploring tradeoffs in pleiotropy and redundancy using evolutionary computing

Berryman

Khoo

Nguyen

et al. 2004

View full text Add to dashboard Cite

Evolutionary computation algorithms are increasingly being used to solve optimization problems as they have many advantages over traditional optimization algorithms. In this paper we use evolutionary computation to study the trade-off between pleiotropy and redundancy in a client-server based network. Pleiotropy is a term used to describe components that perform multiple tasks, while redundancy refers to multiple components performing one same task. Pleiotropy reduces cost but lacks robustness, while redundancy increases network reliability but is more costly, as together, pleiotropy and redundancy build flexibility and robustness into systems. Therefore it is desirable to have a network that contains a balance between pleiotropy and redundancy. We explore how factors such as link failure probability, repair rates, and the size of the network influence the design choices that we explore using genetic algorithms.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hiep Nguyen

PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems

FChain: Toward Black-Box Online Fault Localization for Cloud Systems

PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds

Ubl

Exploring tradeoffs in pleiotropy and redundancy using evolutionary computing

Contact Info

Product

Resources

About