This paper describes extensions to OpenMP that implement data placement features needed for NUMA architectures. OpenMP is a collection of compiler directives and library routines used to write portable parallel programs for shared-memory architectures. Writing efficient parallel programs for NUMA architectures, which have characteristics of both shared-memory and distributed-memory architectures, requires that a programmer control the placement of data in memory and the placement of computations that operate on that data. Optimal performance is obtained when computations occur on processors that have fast access to the data needed by those computations. OpenMP -- designed for shared-memory architectures -- does not by itself address these issues. The extensions to OpenMP Fortran presented here have been mainly taken from High Performance Fortran. The paper describes some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions. It also describes some additional compiler optimizations, and concludes with some preliminary results.
This paper evaluates performance characteristics of the HP GS1280 shared memory multiprocessor system. The GS1280 system contains up to 64 Alpha 21364 CPUs connected together via a torus-based interconnect. We describe architectural features of the GS1280 system. We compare and contrast the GS1280 to the previousgeneration Alpha systems: AlphaServer GS320 and ES45/SC45. We further quantitatively show the performance effects of these features using application results and profiling data based on the built-in performance counters. We find that the HP GS1280 often provides 2 to 3 times the performance of the AlphaServer GS320 at similar clock frequencies. We find the key reasons for such performance gains are advances in memory, inter-processor, and I/O subsystem designs.
The characteristics of several commercial and technical workloads on the DEC 7000 AXP system are compared using built-in hardware monitors. The data analyzed include total instructions, cycles, multiple-issued instructions, stall components, cache misses, and instruction types. The data indicates that the two classes of Workloads have vastly different characteristics and impose different requirements on the system design.Compared to VAX, Alpha AXP takes advantage of lower cycles per instruction and cycle time to achieve a significant performance advantage. The cache and memory interconnect subsystems are expected to play a crucial role in the performance of future systems. A simple model for evaluating the effects of various design tradeoffs based on the data collected by using hardware monitors is proposed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.