Lukas Rupprecht scite author profile

Lukas Rupprecht

5Publications

87Citation Statements Received

53Citation Statements Given

How they've been cited

141

How they cite others

121

Affiliations

IBM Research - Almaden, Imperial College London, IBM (United States)

Publications

Order By: Most citations

CloudScope: Diagnosing and Managing Performance Interference in Multi-tenant Clouds

Chen

Rupprecht

Osman

et al. 2015

View full text Add to dashboard Cite

Abstract-Virtual machine consolidation is attractive in cloud computing platforms for several reasons including reduced infrastructure costs, lower energy consumption and ease of management. However, the interference between co-resident workloads caused by virtualization can violate the service level objectives (SLOs) that the cloud platform guarantees. Existing solutions to minimize interference between virtual machines (VMs) are mostly based on comprehensive micro-benchmarks or online training which makes them computationally intensive.In this paper, we present CloudScope, a system for diagnosing interference for multi-tenant cloud systems in a lightweight way. CloudScope employs a discrete-time Markov Chain model for the online prediction of performance interference of co-resident VMs. It uses the results to optimally (re)assign VMs to physical machines and to optimize the hypervisor configuration, e.g. the CPU share it can use, for different workloads. We have implemented CloudScope on top of the Xen hypervisor and conducted experiments using a set of CPU, disk, and network intensive workloads and a real system (MapReduce). Our results show that CloudScope interference prediction achieves an average error of 9%. The interference-aware scheduler improves VM performance by up to 10% compared to the default scheduler. In addition, the hypervisor reconfiguration can improve network throughput by up to 30%.

show abstract

Large-Scale Analysis of the Docker Hub Dataset

Zhao

Tarasov

Albahar

et al. 2019

View full text Add to dashboard Cite

Improving reproducibility of data science pipelines through transparent provenance capture

Rupprecht¹,

Davis

Arnold³

et al. 2020

Proc. VLDB Endow.

View full text Add to dashboard Cite

Data science has become prevalent in a large variety of domains. Inherent in its practice is an exploratory, probing, and fact finding journey, which consists of the assembly, adaptation, and execution of complex data science pipelines. The trustworthiness of the results of such pipelines rests entirely on their ability to be reproduced with fidelity, which is difficult if pipelines are not documented or recorded minutely and consistently. This difficulty has led to a reproducibility crisis and presents a major obstacle to the safe adoption of the pipeline results in production environments. The crisis can be resolved if the provenance for each data science pipeline is captured transparently as pipelines are executed. However, due to the complexity of modern data science pipelines, transparently capturing sufficient provenance to allow for reproducibility is challenging. As a result, most existing systems require users to augment their code or use specific tools to capture provenance, which hinders productivity and results in a lack of adoption. In this paper, we present Ursprung, 1 a transparent provenance collection system designed for data science environments. 2 The Ursprung philosophy is to capture provenance and build lineage by integrating with the execution environment to automatically track static and runtime configuration parameters of data science pipelines. Rather than requiring data scientists to make changes to their code, Ursprung records basic provenance information from system-level sources and combines it with provenance from application-level sources (e.g., log files, stdout), which can be accessed and recorded through a domain-specific language. In our evaluation, we show that Ursprung is able to capture sufficient provenance for a variety of use cases and only adds an overhead of up to 4%.

show abstract

Large-Scale Analysis of Docker Images and Performance Implications for Container Storage Systems

Zhao

Tarasov

Albahar

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Exploiting in-network processing for big data management

Rupprecht

2013

View full text Add to dashboard Cite

Data processing systems face the task of efficiently storing and processing data at petabyte scale, with the amount set to increase in the future. To meet such a requirement, highly scalable, shared-nothing systems, e.g. Google's BigTable [6] or Facebook's Cassandra [14], are built to partition data and process it in parallel on distributed nodes in a cluster. This allows the handling of data at scale but introduces new challenges due to the distribution of data. Running queries involves a high network overhead because data has to be exchanged between cluster nodes and hence, the network becomes a critical part of the system. To avoid the network bottleneck, it is essential for distributed data processing systems (DDPS) to be aware of the network rather than treating it as a black box.We propose in-network processing as a way of achieving network-awareness to decrease bandwidth usage by custom routing, redundancy elimination, and on-path data reduction. Thereby, we can increase the query throughput of a DDPS. The challenges of an in-network processing system range from design issues, such as performance and transparency, to the integration with query optimisation and deployment in data centres. We formulate these challenges as possible research directions and provide a prototype implementation. Our preliminary results suggest that we can significantly improve query throughput in a DDPS by performing partial data reduction within the network.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Lukas Rupprecht

CloudScope: Diagnosing and Managing Performance Interference in Multi-tenant Clouds

Large-Scale Analysis of the Docker Hub Dataset

Improving reproducibility of data science pipelines through transparent provenance capture

Large-Scale Analysis of Docker Images and Performance Implications for Container Storage Systems

Exploiting in-network processing for big data management

Contact Info

Product

Resources

About