Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our implementations of a global address language on the Thinking Machines CM-5, Intel Paragon, Meiko CS-2, Cray T3D, and Berkeley NOW. This evaluation includes a range of compilation strategies that make v arying use of the network processor each is optimized for the target architecture and the particular strategy. W e analyze a family of interacting issues that determine the performance tradeo s in each implementation, quantify the resulting latency, overhead, and bandwidth of the global access operations, and demonstrate the e ects on application performance.
IntroductionIn recent y ears several architectures have demonstrated practical scalability b e y ond a thousand microprocessors, including the nCUBE/2, Thinking Machines CM-5, Intel Paragon, Meiko CS-2, and Cray T3D. More recently, researchers have also demonstrated high performance communication in Network of Workstations (NOW) using scalable switched local area network technology 28,6,12]. While the dominant programming model at this scale is message passing, the primitives used are inherently expensive, due to bu ering and scheduling overheads 29]. Consequently, these machines provide varying levels of architectural support for communication in a global address space via various forms of memory read and write.We developed the Split-C language to allow experimentation with new communication hardware mechanisms by involving the compiler in the support for the global address operations. Global memory operations are statically typed, so the Split-C compiler can generate a short sequence of code for each potentially remote operation as required by