Robert Cox scite author profile

Cache-coherent multiprocessors with distributed shared memory are becoming increasingly popular for parallel computing. However; obtaining high pe$ormance on these machines mquires that an application execute with good data locality. In addition to making efiective use of caches, it is often necessary to distribute data structures across the local memories of the processing nodes, thereby reducing the latency of cache misses.While processor caches can exploit temporal locality on both local and remote data, many applications, such as those without temporal reuse or with working sets larger than the cache, are unable to benefit from cache locality alone. To obtain high performance on such applications, it is often necessary to distribute the data structures in the program so that the cache misses of each processor are more likely to be satisfied from local rather than remote memory.We have designed a set of abstractions for performing data distribution in the context of explicitly parallel programs and implemented them within the SGZ MZPSpro compiler system. Our system incorporates many unique features to enhance both ptogrammability and performance. We address the former by providing a very simple ptvgmmming model with extensive support for emor detection. Reganiing performance, we carefully design the user abstractions with the wuierlying compiler optimitations in mind, we incorporate several optimization techniques to generate eJEcient code for accessing distributed data, and we provide a tight integration of these techniques with other optimizations within the compiler Our initial experience suggests that the directives are easy to use and can yield substantial performance gains, in some cases by as much as a factor of 3 over the same codes without distribution.In this paper we describe a set of data distribution abstractions for CC-NUMA multiprocessors. We have designed these abstractions as a set of directives that allow the programmer to manually control the distribution of array data structures in explicitly parallel programs. We provide a small set of abstractions that are easy to use. yet expressive enough for real applications. Our directives are integrated with existing mechanisms for exploiting loop-level parallelism. Furthermore, the directives are designed keeping in mind the compiler's ability to generate efficient code for accesses to distributed data. Taken together, our abstractions enable the programmer to exploit loop-level parallelism and exercise fine control over both data distribution and computation scheduling. We have implemented these directives in the SGI MIPSpro7.1 commercial compiler system targeting the Origin-2000 multiprocessor.

show abstract

An experimental evaluation of tiling and shackling for memory hierarchy management

Kodukula

¹

,

Pingali

²

,

Cox

³

et al. 1999

View full text Add to dashboard Cite

On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed localityenhancing program transformations, the most well-known of which is loop tiling. Tiling is restricted to perfectly nested loops, but many imperfectly nested loops can be transformed into perfectly nested loops that can then be tiled. Recently, we proposed an alternative approach to locality enhancement called data shackling. Data shackling reasons about data traversals rather than iteration space traversals, and can be applied directly to imperfectly nested loops. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of tiling. Our experiments on the SGI Octane workstation with dense numerical linear algebra programs show that shackled code obtains double the performance of tiled code for most of these programs, and obtains five times the performance of tiled code for some versions of Cholesky factorization. Data shackling has been integrated into the SGI MIPSPro compiler product-line.

show abstract

Inter-procedural loop fusion, array contraction and rotation

Ng¹,

Kulkarni²,

Li³

et al.

View full text Add to dashboard Cite

In this paper, we present the design and implementation of an inter-procedural loop fusion, array contraction and rotation technique in a production compiler. We provide experimental results to show that this technique improves SPECfp2000 benchmarks by 12%. The technique employs a locality-conscious inter-procedural analysis to drive inlining decisions. It then uses regular section analysis and code motion techniques to enable loop fusion across procedure boundaries. We discuss the implementation of data promotion and array contraction techniques. We introduce array rotation technique to eliminate the overhead of copying array sections.

show abstract

Accuracy of Neural Network Classifiers as a Property of the Size of the Data Set

Crowther

¹

,

Cox

²

2006

View full text Add to dashboard Cite

Data distribution support on distributed shared memory multiprocessors

Chandra

¹

,

Chen

²

,

Cox

³

et al. 1997

View full text Add to dashboard Cite

Cache-coherent multiprocessors with distributed shared memory are becoming increasingly popular for parallel computing. However; obtaining high pe$ormance on these machines mquires that an application execute with good data locality. In addition to making efiective use of caches, it is often necessary to distribute data structures across the local memories of the processing nodes, thereby reducing the latency of cache misses.While processor caches can exploit temporal locality on both local and remote data, many applications, such as those without temporal reuse or with working sets larger than the cache, are unable to benefit from cache locality alone. To obtain high performance on such applications, it is often necessary to distribute the data structures in the program so that the cache misses of each processor are more likely to be satisfied from local rather than remote memory. We have designed a set of abstractions for performing data distribution in the context of explicitly parallel programs and implemented them within the SGZ MZPSpro compiler system. Our system incorporates many unique features to enhance both ptogrammability and performance. We address the former by providing a very simple ptvgmmming model with extensive support for emor detection. Reganiing performance, we carefully design the user abstractions with the wuierlying compiler optimitations in mind, we incorporate several optimization techniques to generate eJEcient code for accessing distributed data, and we pro-vide a tight integration of these techniques with other optimizations within the compiler Our initial experience suggests that the directives are easy to use and can yield substantial performance gains, in some cases by as much as a factor of 3 over the same codes without distribution.

show abstract

Robert Cox

Randomized, Controlled Trial of Video Self-Instruction Versus Traditional CPR Training

Lysozyme-Induced Nephropathy

A Method for Optimal Division of Data Sets for Use in Neural Networks

Data distribution support on distributed shared memory multiprocessors

An experimental evaluation of tiling and shackling for memory hierarchy management

Inter-procedural loop fusion, array contraction and rotation

Accuracy of Neural Network Classifiers as a Property of the Size of the Data Set

Data distribution support on distributed shared memory multiprocessors

Contact Info

Product

Resources

About