Cache-coherent multiprocessors with distributed shared memory are becoming increasingly popular for parallel computing. However; obtaining high pe$ormance on these machines mquires that an application execute with good data locality. In addition to making efiective use of caches, it is often necessary to distribute data structures across the local memories of the processing nodes, thereby reducing the latency of cache misses.While processor caches can exploit temporal locality on both local and remote data, many applications, such as those without temporal reuse or with working sets larger than the cache, are unable to benefit from cache locality alone. To obtain high performance on such applications, it is often necessary to distribute the data structures in the program so that the cache misses of each processor are more likely to be satisfied from local rather than remote memory.We have designed a set of abstractions for performing data distribution in the context of explicitly parallel programs and implemented them within the SGZ MZPSpro compiler system. Our system incorporates many unique features to enhance both ptogrammability and performance. We address the former by providing a very simple ptvgmmming model with extensive support for emor detection. Reganiing performance, we carefully design the user abstractions with the wuierlying compiler optimitations in mind, we incorporate several optimization techniques to generate eJEcient code for accessing distributed data, and we provide a tight integration of these techniques with other optimizations within the compiler Our initial experience suggests that the directives are easy to use and can yield substantial performance gains, in some cases by as much as a factor of 3 over the same codes without distribution.In this paper we describe a set of data distribution abstractions for CC-NUMA multiprocessors. We have designed these abstractions as a set of directives that allow the programmer to manually control the distribution of array data structures in explicitly parallel programs. We provide a small set of abstractions that are easy to use. yet expressive enough for real applications. Our directives are integrated with existing mechanisms for exploiting loop-level parallelism. Furthermore, the directives are designed keeping in mind the compiler's ability to generate efficient code for accesses to distributed data. Taken together, our abstractions enable the programmer to exploit loop-level parallelism and exercise fine control over both data distribution and computation scheduling. We have implemented these directives in the SGI MIPSpro7.1 commercial compiler system targeting the Origin-2000 multiprocessor.
On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed localityenhancing program transformations, the most well-known of which is loop tiling. Tiling is restricted to perfectly nested loops, but many imperfectly nested loops can be transformed into perfectly nested loops that can then be tiled. Recently, we proposed an alternative approach to locality enhancement called data shackling. Data shackling reasons about data traversals rather than iteration space traversals, and can be applied directly to imperfectly nested loops. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of tiling. Our experiments on the SGI Octane workstation with dense numerical linear algebra programs show that shackled code obtains double the performance of tiled code for most of these programs, and obtains five times the performance of tiled code for some versions of Cholesky factorization. Data shackling has been integrated into the SGI MIPSPro compiler product-line.
In this paper, we present the design and implementation of an inter-procedural loop fusion, array contraction and rotation technique in a production compiler. We provide experimental results to show that this technique improves SPECfp2000 benchmarks by 12%. The technique employs a locality-conscious inter-procedural analysis to drive inlining decisions. It then uses regular section analysis and code motion techniques to enable loop fusion across procedure boundaries. We discuss the implementation of data promotion and array contraction techniques. We introduce array rotation technique to eliminate the overhead of copying array sections.
Cache-coherent multiprocessors with distributed shared memory are becoming increasingly popular for parallel computing. However; obtaining high pe$ormance on these machines mquires that an application execute with good data locality. In addition to making efiective use of caches, it is often necessary to distribute data structures across the local memories of the processing nodes, thereby reducing the latency of cache misses.While processor caches can exploit temporal locality on both local and remote data, many applications, such as those without temporal reuse or with working sets larger than the cache, are unable to benefit from cache locality alone. To obtain high performance on such applications, it is often necessary to distribute the data structures in the program so that the cache misses of each processor are more likely to be satisfied from local rather than remote memory. We have designed a set of abstractions for performing data distribution in the context of explicitly parallel programs and implemented them within the SGZ MZPSpro compiler system. Our system incorporates many unique features to enhance both ptogrammability and performance. We address the former by providing a very simple ptvgmmming model with extensive support for emor detection. Reganiing performance, we carefully design the user abstractions with the wuierlying compiler optimitations in mind, we incorporate several optimization techniques to generate eJEcient code for accessing distributed data, and we pro-vide a tight integration of these techniques with other optimizations within the compiler Our initial experience suggests that the directives are easy to use and can yield substantial performance gains, in some cases by as much as a factor of 3 over the same codes without distribution.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.