Lazy scheduling is a runtime scheduler for task-parallel codes that effectively coarsens parallelism on load conditions in order to significantly reduce its overheads compared to existing approaches, thus enabling the efficient execution of more fine-grained tasks. Unlike other adaptive dynamic schedulers, lazy scheduling does not maintain any additional state to infer system load and does not make irrevocable serialization decisions. These two features allow it to scale well and to provide excellent load balancing in practice but at a much lower overhead cost compared to work stealing, the golden standard of dynamic schedulers. We evaluate three variants of lazy scheduling on a set of benchmarks on three different platforms and find it to substantially outperform popular work stealing implementations on fine-grained codes. Furthermore, we show that the vast performance gap between manually coarsened and fully parallel code is greatly reduced by lazy scheduling, and that, with minimal static coarsening, lazy scheduling delivers performance very close to that of fully tuned code.The tedious manual coarsening required by the best existing work stealing schedulers and its damaging effect on performance portability have kept novice and general-purpose programmers from parallelizing their codes. Lazy scheduling offers the foundation for a declarative parallel programming methodology that should attract those programmers by minimizing the need for manual coarsening and by greatly enhancing the performance portability of parallel code. ACM Reference Format:Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. 2014. Lazy scheduling: A runtime adaptive scheduler for declarative parallelism.
The Explicit Multi-Threading (XMT) is a generalpurpose many-core computing platform, with the vision of a 1000-core chip that is easy to program but does not compromise on performance. This paper presents a publicly available toolchain for XMT, complete with a highly configurable cycle-accurate simulator and an optimizing compiler.The XMT toolchain has matured and has been validated to a point where its description merits publication. In particular, research and experimentation enabled by the toolchain played a central role in supporting the ease-of-programming and performance aspects of the XMT architecture. The compiler and the simulator are also important milestones for an efficient programmer's workflow from PRAM algorithms to programs that run on the shared memory XMT hardware. This workflow is a key component in accomplishing the dual goal of ease-ofprogramming and performance.The applicability of our toolchain extends beyond specific XMT choices. It can be used to explore the much greater design space of shared memory many-cores by system researchers or by programmers. As the toolchain can practically run on any computer, it provides a supportive environment for teaching parallel algorithmic thinking with a programming component. Unobstructed by techniques such as decomposition-first and programming for locality, this environment may be useful in deferring the teaching of these techniques, when desired, to more advanced or platform-specific courses.
Abstract-Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor lighter cores with less resources.Support for hardware and software prefetch increase MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We show that in situations where not enough resources are available to issue prefetch instructions for all references in a loop, it is more beneficial to decrease the prefetch distance and prefetch for as many references as possible, rather than use a fixed prefetched distance and skip prefetching for some references, as in current approaches.We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% and the state-of-the art GCC implementation by up to 34.79%. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show improvements of up to 24.61%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.