Abstract-Automatic data distribution is a key feature to obtain efficient implementations from abstract and portable parallel codes. We present a highly efficient and extensible runtime library that integrates techniques for automatic data partition and mapping. It uses a novel approach to define an abstract interface and a plug-in system to encapsulate different types of regular and irregular techniques, helping to generate codes which are independent of the exact mapping functions selected. Currently, it supports hierarchical tiling of arrays with dense and stride domains, that allows the implementation of both data and task parallelism using a SPMD model. It automatically computes appropriate domain partitions for a selected virtual topology, mapping them to available processors with static or dynamic load-balancing techniques. Our library also allows the construction of reusable communication patterns that efficiently exploit MPI communication capabilities. The use of our library greatly reduces the complexity of data distribution and communication, hiding the details of the underlying architecture. The library can be used as an abstract layer for building generic tiling operations as well. Our experimental results show that the use of this library allows to achieve similar performance as carefully-implemented manual versions for several, well-known parallel kernels and benchmarks in distributed and multicore systems, and substantially reduces programming effort.
The Single-Source Shortest Path (SSSP) problem arises in many different fields. In this paper we present a GPUbased version of the Crauser et al. SSSP algorithm. Our work significantly speeds up the computation of the SSSP, not only with respect to the CPU-based version, but also to other state-ofthe-art GPU implementation based on Dijkstra, due to Martín et al. Both GPU implementations have been evaluated using the last Nvidia architecture (Kepler). Our experimental results show that the new GPU-Crauser algorithm leads to speed-ups from 13× to 220× with respect to the CPU version and a performance gain of up to 17% with respect the GPU-Martín algorithm.
OpenMP directives are the de-facto standard for shared-memory parallel programming. However, OpenMP does not guarantee the correctness of the parallel execution of a given loop if runtime data dependences arise. Consequently, many highlyparallel regions cannot be safely parallelized with OpenMP due to the possibility of a dependence violation. In this paper, we propose to augment OpenMP capabilities, by adding Thread-Level Speculation (TLS) support. Our contribution is threefold. First, we have defined a new speculative clause for variables inside parallel loops. This clause ensures that all accesses to these variables will be carried out according to sequential semantics. Second, we have created a new, software-based TLS runtime library to ensure correctness in the parallel execution of OpenMP loops that include speculative variables. Third, we have developed a new GCC plugin, which seamlessly translates our OpenMP speculative clause into calls to our TLS runtime engine. The result is the ATLaS C Compiler framework, which takes advantage of TLS techniques to expand OpenMP functionalities, and guarantees the sequential semantics of any parallelized loop.
The Single-Source Shortest Path (SSSP) problem arises in many different fields. In this paper, we present a GPU SSSP algorithm implementation. Our work significantly speeds up the computation of the SSSP, not only with respect to a CPU-based version, but also to other state-of-the-art GPU implementations based on Dijkstra. Both GPU implementations have been evaluated using the latest NVIDIA architectures. The graphs chosen as input sets vary in nature, size, and fan-out degree, in order to evaluate the behavior of the algorithms for different data classes. Additionally, we have enhanced our GPU algorithm implementation using two optimization techniques: The use of a proper choice of threadblock size; and the modification of the GPU L1 cache memory state of NVIDIA devices. These optimizations lead to performance improvements of up to 23% with respect to the non-optimized versions. In addition, we have made a platform comparison of several NVIDIA boards in order to distinguish which one is better for each class of graphs, depending on their features. Finally, we compare our results with an optimized sequential implementation of Dijkstra's algorithm included in the reference Boost library, obtaining an improvement ratio of up to 19× for some graph families, using less memory space.
Programming models of pure nested-parallelism are appealing due to their ease of programming and good analysis and debugging properties. Although their simple synchronization structure is appropriate to represent abstract parallel algorithms, it does not take into account many implementation issues. In this work we present TRASGO, a programming system based on high-level, nested-parallel specifications. We show how it allows to easily express complex combinations of data and task parallelism with a common scheme, hiding the layout and scheduling details. The approach allows the development of a modular compiler where automatic transformation techniques may exploit lower level and more complex synchronization structures, unlocking the limitations of pure nested-parallel programming. This article presents an overview of the features of TRASGO, and its architecture. We present some performance results using well-known parallel algorithms, and a roadmap of improvements and new features to be added to TRASGO.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.