Tiling has proven to be an effective mechanism to develop high performance implementations of algorithms. Tiling can be used to organize computations so that communication costs in parallel programs are reduced and locality in sequential codes or sequential components of parallel programs is enhanced.In this paper, a data type -Hierarchically Tiled Arrays or HTAs -that facilitates the direct manipulation of tiles is introduced. HTA operations are overloaded array operations. We argue that the implementation of HTAs in sequential OO languages transforms these languages into powerful tools for the development of high-performance parallel codes and codes with high degree of locality. To support this claim, we discuss our experiences with the implementation of HTAs for MATLAB and C++ and the rewriting of the NAS benchmarks and a few other programs into HTA-based parallel form.
SUMMARYStencil computations constitute the kernel of many scientific applications. Tiling is often used to improve the performance of stencil codes for data locality and parallelism. However, tiled stencil codes typically require shadow regions, whose management becomes a burden to programmers. In fact, it is often the case that the code required to manage these regions, and in particular their updates, is much longer than the computational kernel of the stencil. As a result, shadow regions usually impact programmers' productivity negatively. In this paper, we describe overlapped tiling, a construct that supports shadow regions in a convenient, flexible and efficient manner in the context of the hierarchically tiled array (HTA) data type. The HTA is a class designed to express algorithms with a high degree of parallelism and/or locality as naturally as possible in terms of tiles. We discuss the syntax and implementation of overlapped HTAs as well as our experience in rewriting parallel and sequential codes using them. The results have been satisfactory in terms of both productivity and performance. For example, overlapped HTAs reduced the number of communication statements in non-trivial codes by 78% on average while speeding them up. We also examine different implementation options and compare overlapped HTAs with previous approaches.
This paper demonstrates the first tera-scale performance of Intel R Xeon Phi TM coprocessors on 1D fft computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 tflops with 512 nodes, which is 1.5× than achievable on a same number of Intel R Xeon R nodes. It is a challenge to fully utilize the compute capability presented by many-core widevector processors for bandwidth-bound fft computation. We leverage a new algorithm, Segment-of-Interest fft, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running fft on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging hpc systems that are increasingly communication limited.
The importance of tiles or blocks in scientific computing cannot be overstated. Many algorithms, both iterative and recursive, can be expressed naturally if tiles are represented explicitly. From the point of view of performance, tiling, either as a code or a data layout transformation, is one of the most effective ways to exploit locality, which is a must to achieve good performance in current computers because of the significant difference in speed between processor and memory. Furthermore, tiles are also useful to express data distribution in parallel computations. However, despite the importance of tiles, most languages do not support them directly. This gives place to bloated programs populated with numerous subscript expressions which make the code difficult to read and coding mistakes more likely.This paper discusses Hierarchically Tiled Arrays (HTAs), a data type which facilitates the easy manipulation of tiles in objectoriented languages with emphasis on two new features, dynamic partitioning and overlapped tiling. These features facilitate the expression of locality and communication while maintaining the same performance of algorithms written using conventional languages.
In this paper, we show our initial experience with a class of objects, called Hierarchically Tiled Arrays (HTAs), that encapsulate parallelism. HTAs allow the construction of single-threaded parallel programs where a master process distributes tasks to be executed by a collection of servers holding the components (tiles) of the HTAs. The tiled and recursive nature of HTAs facilitates the adaptation of the programs that use them to varying machine configurations, and eases the mapping of data and tasks to parallel computers with a hierarchical organization. We have implemented HTAs as a MATLAB TM toolbox, overloading conventional operators and array functions such that HTA operations appear to the programmer as extensions of MATLAB TM . Our experiments show that the resulting environment is ideal for the prototyping of parallel algorithms and greatly improves the ease of development of parallel programs while providing reasonable performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.