“…With empty tasks [28], the resulting upper bound on task scheduling throughput fails to represent useful work within a realistic application. With non-empty tasks, since the efficiency of the overall application is typically not reported [3,6], TPS is not a measurement of runtime-limited performance. Large tasks may be used to hide any amount of runtime overhead, while small tasks may result in a drop in total application throughput even as TPS increases.…”
Section: Metgmentioning
confidence: 99%
“…Limit studies of task scheduling throughput in various runtime systems often make additional assumptions. A popular assumption is the use of trivially parallel tasks [3,6], which as shown in Section 5.5 underestimates (often substantially) the cost of scheduling a task and can also impact scalability.…”
Section: Related Workmentioning
confidence: 99%
“…Intuitively, for a given workload, METG(50%) is the smallest task granularity that maintains at least 50% efficiency, meaning that the application achieves at least 50% of the highest performance (in FLOP/s, B/s, or other application-specific measure) achieved on a given machine. The efficiency bound in METG is a key innovation over previous approaches, such as tasks per second (TPS), that fail to consider the amount of useful work performed (if tasks are non-empty [3,6]) or to perform useful work at all (if tasks are empty [28]).…”
We present Task Bench, a parameterized benchmark designed to explore the performance of parallel and distributed programming systems under a variety of application scenarios. Task Bench lowers the barrier to benchmarking multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench's parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.We conduct a comprehensive study with implementations of Task Bench in 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. We introduce a novel metric, minimum effective task granularity to study the baseline runtime overhead of each system. We show that when running at scale, 100 µs is the smallest granularity that even the most efficient systems can reliably support with current technologies. We also study each system's scalability, ability to hide communication and mitigate load imbalance.
“…With empty tasks [28], the resulting upper bound on task scheduling throughput fails to represent useful work within a realistic application. With non-empty tasks, since the efficiency of the overall application is typically not reported [3,6], TPS is not a measurement of runtime-limited performance. Large tasks may be used to hide any amount of runtime overhead, while small tasks may result in a drop in total application throughput even as TPS increases.…”
Section: Metgmentioning
confidence: 99%
“…Limit studies of task scheduling throughput in various runtime systems often make additional assumptions. A popular assumption is the use of trivially parallel tasks [3,6], which as shown in Section 5.5 underestimates (often substantially) the cost of scheduling a task and can also impact scalability.…”
Section: Related Workmentioning
confidence: 99%
“…Intuitively, for a given workload, METG(50%) is the smallest task granularity that maintains at least 50% efficiency, meaning that the application achieves at least 50% of the highest performance (in FLOP/s, B/s, or other application-specific measure) achieved on a given machine. The efficiency bound in METG is a key innovation over previous approaches, such as tasks per second (TPS), that fail to consider the amount of useful work performed (if tasks are non-empty [3,6]) or to perform useful work at all (if tasks are empty [28]).…”
We present Task Bench, a parameterized benchmark designed to explore the performance of parallel and distributed programming systems under a variety of application scenarios. Task Bench lowers the barrier to benchmarking multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench's parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.We conduct a comprehensive study with implementations of Task Bench in 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. We introduce a novel metric, minimum effective task granularity to study the baseline runtime overhead of each system. We show that when running at scale, 100 µs is the smallest granularity that even the most efficient systems can reliably support with current technologies. We also study each system's scalability, ability to hide communication and mitigate load imbalance.
“…Most cluster computing frameworks, such as Spark [64], CIEL [40], and Dryad [28] implement a centralized scheduler, which can provide locality but at latencies in the tens of ms. Distributed schedulers such as work stealing [12], Sparrow [45] and Canary [47] can achieve high scale, but they either don't consider data locality [12], or assume tasks belong to independent jobs [45], or assume the computation graph is known [47].…”
“…Canary [47] achieves impressive performance by having each scheduler instance handle a portion of the task graph, but does not handle dynamic computation graphs.…”
The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray-a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.