Software architecture is undergoing a transition from monolithic architectures to microservices to achieve resilience, agility and scalability in software development. However, with microservices it is difficult to diagnose performance issues due to technology heterogeneity, large number of microservices, and frequent updates to both software features and infrastructure. This paper presents MicroRCA, a system to locate root causes of performance issues in microservices. MicroRCA infers root causes in real time by correlating application performance symptoms with corresponding system resource utilization, without any application instrumentation. The root cause localization is based on an attributed graph that model anomaly propagation across services and machines. Our experimental evaluation where common anomalies are injected to a microservice benchmark running in a Kubernetes cluster shows that MicroRCA locates root causes well, with 89% precision and 97% mean average precision, outperforming several state-of-the-art methods.
Despite the potential given by the combination of multitenancy and virtualization, resource utilization in today's data centers is still low. We identify three key characteristics of cloud services and infrastructure as-a-service management practices: burstiness in service workloads, fluctuations in virtual machine resource usage over time, and virtual machines being limited to pre-defined sizes only. Based on these characteristics, we propose scheduling and admission control algorithms that incorporate resource overbooking to improve utilization. A combination of modeling, monitoring, and prediction techniques is used to avoid overpassing the total infrastructure capacity. A performance evaluation using a mixture of workload traces demonstrates the potential for significant improvements in resource utilization while still avoiding overpassing the total capacity.
Abstract-Elasticity is a key characteristic of cloud computing that increases the flexibility for cloud consumers, allowing them to adapt the amount of physical resources associated to their services over time in an on-demand basis. However, elasticity creates problems for cloud providers as it may lead to poor resource utilization, specially in combination with other factors, such as user overestimations and pre-defined VM sizes. Admission control mechanisms are thus needed to increase the number of services accepted, raising the utilization without affecting services performance. This work focuses on implementing an autonomic risk-aware overbooking architecture capable of increasing the resource utilization of cloud data centers by accepting more virtual machines than physical available resources. Fuzzy logic functions are used to estimate the associated risk to each overbooking decision. By using a distributed PID controller approach, the system is capable of self-adapting over time -changing the acceptable level of risk -depending on the current status of the cloud data center. The suggested approach is extensively evaluated using a combination of simulations and experiments executing real cloud applications with real-life available workloads. Our results show a 50% increment at both resource utilization and capacity allocated with acceptable performance degradation and more stable resource utilization over time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.