Performance Isolation of Data-Intensive Scale-out Applications in a Multi-tenant Cloud

Lama, Palden; Wang, Shaoqi; Zhou, Xiaobo; Cheng, Dazhao

doi:10.1109/ipdps.2018.00019

Cited by 17 publications

(13 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this context, a job is composed of multiple smaller tasks (defined as the smallest unit of computation observable by the resource manager) [82]. Such jobs and subsequent tasks are scheduled onto different machines in a parallelized manner to accelerate job completion and are often divided into phases creating a Direct Acyclic Graph (DAG) [83]. Application frameworks (such as MapReduce) attempt to sub-divide jobs so that tasks will approximately complete within the same timeframe for each phase [84].…”

Section: Straggler Definition and Impactmentioning

confidence: 99%

Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

2020

View full text Add to dashboard Cite

Cloud computing systems are splitting compute and data intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud datacenters. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research.

show abstract

Section: Straggler Definition and Impactmentioning

confidence: 99%

Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

2020

View full text Add to dashboard Cite

show abstract

“…The central resource manager (RM) is application-agnostic and completely unaware of runtime QoS requirements of interactive and latency-sensitive applications; RM is only responsible for resource allocation among jobs but leaves all application-specific logic to application managers. Existing solutions of workload co-location either aim at reducing the performance interference through resource partition and isolation [10] [11] [12] or leverage QoS-aware scheduling to place different jobs/applications by minimizing interference [13] [14] [15]. However, they are optimized towards the monolithic application and have indirect effects on DLRAs that have more sophisticated component dependencies and performance variations (e.g., latency) due to a vast number of requests across entire system components.…”

Section: Renyu Yang Is the Corresponding Authormentioning

confidence: 99%

“…For tasks without locality specifications, TOPOSCH is more likely to place the task onto a node with lower risk level to reduce the impact of co-location on the increased latency. To achieve this, we adopt a random number based approach to implicate the tendency of choosing low risk nodes with higher probability (Line [11][12][13][14]. Furthermore, to dominate DLRA's QoS, RM has the privilege to preempt and evict running batch tasks to rescue the detected QoS degradation.…”

Section: B Task Delay Scheduling Under Resource Reservationmentioning

confidence: 99%

TOPOSCH: Latency-Aware Scheduling Based on Critical Path Analysis on Shared YARN Clusters

Zhu

Yang

et al. 2020

2020 IEEE 13th International Conference on Cloud Computing (CLOUD)

View full text Add to dashboard Cite

“…Other elements of related work also rely on internal metrics such as memory and CPU usage [7,10,27,28]. PerfCloud [30] utilizes system-level metrics to proactively detect performance interference between tenant workloads and shows that such approaches succeed in avoiding costly workload profiling and prediction mechanisms without having any interference on application code.…”

Section: Related Workmentioning

confidence: 99%

Thread-Level CPU and Memory Usage Control of Custom Code in Multi-tenant SaaS

Makki

Landuyt

Lagaisse

et al. 2019

Service-Oriented Computing

View full text Add to dashboard Cite

Software-as-a-Service (SaaS) providers commonly support customization of their services to allow them to attract larger tenant bases. The nature of these customizations in practice ranges from anticipated configuration options to sophisticated code extensions. From a SaaS provider viewpoint, the latter category is particularly challenging as it involves executing untrusted tenant custom code in the SaaS production environment. Proper isolation of custom code in turn requires the ability to control the CPU and memory usage of each tenant. In current practice, OS-level virtualization tools such as hypervisors or containers are predominantly used for this purpose. These techniques, however, constrain the number of tenants that a single node can costeffectively accommodate. In this paper, we present a practical solution for thread-level tenant isolation, vis-à-vis CPU and memory usage in presence of tenant-provided custom code. Both Java Runtime Environment (JRE) bytecode and tenant code are instrumented with usage control checkpoints which, based on data gathered using the Java Resource Consumption Management API (JSR-284), ensures that CPU and memory usage of tenants remain within their Service-level Agreements (SLA) limits. Our experiments show that the tenant accommodation capacity of single node increases 59 times with the proposed solution instead of containers. This scalability improvement comes at the average cost of 0.31 ns performance overhead per control checkpoint.

show abstract

Performance Isolation of Data-Intensive Scale-out Applications in a Multi-tenant Cloud

Cited by 17 publications

References 21 publications

Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

TOPOSCH: Latency-Aware Scheduling Based on Critical Path Analysis on Shared YARN Clusters

Thread-Level CPU and Memory Usage Control of Custom Code in Multi-tenant SaaS

Contact Info

Product

Resources

About