Abstract. In computational science, it is necessary to make efficient use of multicore architectures for dealing with complex real-life application problems. However, with increased hardware complexity, the cost in man hours of writing and rewriting software to adapt to evolving computer systems is becoming prohibitive. Task-based parallel programming models aim to allow the application programmers to focus on the algorithms and applications, while the performance is handled by a runtime system that schedules the tasks onto nodes, cores, and accelerators. In this paper we describe a task parallel programming model where dependencies are represented through data versioning. Our model allows expressing the program control flow without artificial dependencies, has low complexity for resolving dependencies, and enables scheduling decisions to be made locally. We implement this as a freely available C++ header-only template library, and show experimental results indicating that our implementation both scales and performs well in comparison to similar runtime systems.Key words. task parallel, data version, dependency, shared memory AMS subject classifications. 65Y05, 65Y10 DOI. 10.1137/1409897161. Background and related work. Modern processors for laptop, desktop, and server computers have several computational cores. In order to write efficient software for such processors, software needs to be parallel. Since writing parallel software is known to be difficult and error-prone, it is desirable that the parallelizationspecific parts are separated from the rest of the software.In this paper, we present a runtime system that handles the details of the parallelization for the user. The runtime system manages dependencies between computations and the mapping of computations to hardware resources for the programmer. Since we specifically target scientific computing applications where performance is key, the provided abstractions are designed carefully not to sacrifice performance. To be practically useful and easy to incorporate in existing solutions, the runtime system is provided as a header-only C++ library and is able to both run on top of OpenMP or to use POSIX threads (Pthreads) for thread management.By moving the dependency management and scheduling into a library that exposes a convenient and expressive interface for specifying dependencies, the development of parallel software becomes easier, faster, and less error-prone and is likely to result in more efficient software. Dependencies and synchronization.The most common way to write shared-memory parallel software is to parallelize for-loops using OpenMP [7]. this works well for many applications, it enforces a fork-join structure where the software is divided up into parallel sections that end with a barrier where all threads are synchronized again. These barriers scale poorly as the number of cores increases and can reduce the performance substantially. To achieve higher performance, synchronization between threads needs to be more fine-grained and reduced to a m...
Dependency-aware task-based parallel programming models have proven to be successful for developing efficient application software for multicore-based computer architectures. The programming model is amenable to programmers, thereby supporting productivity, whereas hardware performance is achieved through a runtime system that dynamically schedules tasks onto cores in such a way that all dependencies are respected. However, even if the scheduling is completely successful with respect to load balancing, the scaling with the number of cores may be suboptimal due to resource contention. Here we consider the problem of scheduling tasks not only with respect to their interdependencies but also with respect to their usage of resources, such as memory and bandwidth. At the software level, this is achieved by user annotations of the task resource consumption. In the runtime system, the annotations are translated into scheduling constraints. Experimental results for different hardware, demonstrating performance gains both for model examples and real applications, are presented. Furthermore, we provide a set of tools to detect resource sensitivity and predict the performance improvements that can be achieved by resource-aware scheduling. These tools are solely based on parallel execution traces and require no instrumentation or modification of the application code.
A scalable RBF-FD method for atmospheric flow. Journal of Computational Physics AbstractRadial basis function-generated finite difference (RBF-FD) methods have recently been proposed as very interesting for global scale geophysical simulations, and have been shown to outperform established pseudo-spectral and discontinuous Galerkin methods for shallow water test problems. In order to be competitive for very large scale simulations, the RBF-FD methods needs to be efficiently implemented for modern multicore based computer architectures. This is a challenging assignment, because the main computational operations are unstructured sparse matrix-vector multiplications, which in general scale poorly on multicore computers due to bandwidth limitations. However, with the task parallel implementation described here we achieve 60-100% of theoretical speedup within a shared memory node, and 80-100% of linear speedup across nodes. We present results for global shallow water benchmark problems with a 30 km resolution.
Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task parallel programming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task parallel programming, and show that it is competitive. arXiv:1801.03578v1 [cs.DC] 10 Jan 2018 to the computational work performed by one node, the 1×9 process grid has the smallest variance between nodes, and therefore also the lowest maximum work size. The 9 × 1 process grid leads to smaller maximum work size than the 3 × 3 process grid if B is large enough, but suffers from significant load imbalance in the case B = 18. In all cases, the work becomes more evenly distributed if the number of level 1 tasks B is larger. The statistics for communication and computation point in different directions, but when comparing with actual run times, we have found that the communication size is the most informative measure. Having a large total communication size is likely to be detrimental to performance as the risk of tasks left waiting for remote data increases as well as the risk of congestion of messages. A square process grid is the factor that has the largest impact. Regarding the block sizes, having a large B improves the load balance, but increases the amount of communication as well as the number of messages (another indicator that is not shown in the graphics).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.