Virtual machine monitors (VMMs) have been hailed as the basis for an increasing number of reliable or trusted computing systems. The Xen VMM is a relatively small piece of software -a hypervisor -that runs at a lower level than a conventional operating system in order to provide isolation between virtual machines: its size is offered as an argument for its trustworthiness. However, the management of a Xen-based system requires a privileged, fullblown operating system to be included in the trusted computing base (TCB).In this paper, we introduce our work to disaggregate the management virtual machine in a Xen-based system. We begin by analysing the Xen architecture and explaining why the status quo results in a large TCB. We then describe our implementation, which moves the domain builder, the most important privileged component, into a minimal trusted compartment. We illustrate how this approach may be used to implement "trusted virtualisation" and improve the security of virtual TPM implementations. Finally, we evaluate our approach in terms of the reduction in TCB size, and by performing a security analysis of the disaggregated system.
Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models * Work done primarily at Google Brain. that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations.We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.
TensorFlow is a powerful, programmable system for machine learning. This paper aims to provide the basics of a conceptual framework for understanding the behavior of TensorFlow models during training and inference: it describes an operational semantics, of the kind common in the literature on programming languages. More broadly, the paper suggests that a programming-language perspective is fruitful in designing and in explaining systems such as TensorFlow. CCS Concepts • Theory of computation → Operational semantics; • Computing methodologies → Neural networks; • Software and its engineering → Data flow architectures
No abstract
Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, applications that require all three have relied on multiple platforms, at the expense of efficiency, maintainability, and simplicity. Naiad resolves the complexities of combining these features in one framework.A new computational model, timely dataflow, underlies Naiad and captures opportunities for parallelism across a wide class of algorithms. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism.We show that many powerful high-level programming models can be built on Naiad's low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining. Naiad outperforms specialized systems in their target application domains, and its unique features enable the development of new high-performance applications.
We describe the timely dataflow model for distributed computation and its implementation in the Naiad system. The model supports stateful iterative and incremental computations. It enables both low-latency stream processing and high-throughput batch processing, using a new approach to coordination that combines asynchronous and fine-grained synchronous execution. We describe two of the programming frameworks built on Naiad: GraphLINQ for parallel graph processing, and differential dataflow for nested iterative and incremental computations. We show that a generalpurpose system can achieve performance that matches, and sometimes exceeds, that of specialized systems.
Abstract. Tracking the progress of computations can be both important and delicate in distributed systems. In a recent distributed algorithm for this purpose, each processor maintains a delayed view of the pending work, which is represented in terms of points in virtual time. This paper presents a formal specification of that algorithm in the temporal logic TLA, and describes a mechanically verified correctness proof of its main properties.
At the heart of a secure software system is a small, trustworthy component, called the Trusted Computing Base (TCB). However, developers persist in building monolithic systems that force their users to trust the entire system. We posit that this is due to the lack of a straightforward mechanism for partitioning -or disaggregating -systems into trusted and untrusted components. We propose to use the dynamic library as the unit of disaggregation, because it is a familiar abstraction, which is commonly used in mainstream software development.In this paper, we present our early ideas on the disaggregated library approach, which can be applied to existing applications that run on commodity operating systems. We first make the case for a new approach to disaggregation, and then describe how we are implementing it. We also draw comparisons with the wide range of related work in this area.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.