This paper describes a general algorithm and a system for load balancing sparse fluid simulations. Automatically distributing sparse fluid simulations efficiently is challenging because the computational load varies across the simulation domain and time. A key challenge with load balancing is that optimal decision making requires knowing the fluid distribution across partitions for future time steps, but computing this state for an arbitrary simulation requires running the simulation itself. The key insight of this paper is that it is possible to predict future load by running a speculative low resolution simulation in parallel. We mathematically formulate the problem of load balancing over multiple time steps and present a polynomial time algorithm to compute an approximate solution to it. Our experimental results show that distributing and speculatively load balancing sparse FLIP simulations over 8 nodes speeds them up by 5.3× to 7.9×, and that speculative load balancing generates assignments that perform within 20% of optimal.
Existing cloud computing control planes do not scale to more than a few hundred cores, while frameworks without control planes scale but take seconds to reschedule a job. We propose an asynchronous control plane for cloud computing systems, in which a central controller can dynamically reschedule jobs but worker nodes never block on communication with the controller. By decoupling control plane traffic from program control flow in this way, an asynchronous control plane can scale to run millions of computations per second while being able to reschedule computations within milliseconds. We show that an asynchronous control plane can match the scalability and performance of TensorFlow and MPI-based programs while rescheduling individual tasks in milliseconds. Scheduling an individual task takes 1μs, such that a 1,152 core cluster can schedule over 120 million tasks/second and this scales linearly with the number of cores. The ability to schedule huge numbers of tasks allows jobs to be divided into very large numbers of tiny tasks, whose improved load balancing can speed up computations 2.1-2.3×. CCS CONCEPTS • Computer systems organization → Cloud computing; • Computing methodologies → Parallel algorithms;
Distributing a simulation across many machines can drastically speed up computations and increase detail. The computing cloud provides tremendous computing resources, but weak service guarantees force programs to manage significant system complexity: nodes, networks, and storage occasionally perform poorly or fail.
We describe Nimbus, a system that automatically distributes grid-based and hybrid simulations across cloud computing nodes. The main simulation loop is sequential code and launches distributed computations across many cores. The simulation on each core runs as if it is stand-alone: Nimbus automatically stitches these simulations into a single, larger one. To do this efficiently, Nimbus introduces a four-layer data model that translates between the contiguous, geometric objects used by simulation libraries and the replicated, fine-grain objects managed by its underlying cloud computing runtime.
Using PhysBAM particle-level set fluid simulations, we demonstrate that Nimbus can run higher detail simulations faster, distribute simulations on up to 512 cores, and run enormous simulations (1024
3
cells). Nimbus automatically manages these distributed simulations, balancing load across nodes and recovering from failures. Implementations of PhysBAM water and smoke simulations as well as an open source heat-diffusion simulation show that Nimbus is general and can support complex simulations.
Nimbus can be downloaded from https://nimbus.stanford.edu.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.