The Stanford Dash multiprocessor

Lenoski, Daniel E.; Laudon, James; Gharachorloo, Kourosh; Weber, Wolf-Dietrich; Gupta, Anoop; Hennessy, John L.; Horowitz, Mark; Lam, Monica S.

doi:10.1109/2.121510

Cited by 765 publications

(337 citation statements)

References 13 publications

Supporting

Mentioning

327

Contrasting

Unclassified

Order By: Relevance

“…In this section, we take a closer look at two speci c DSM implementations, the hardware cache-coherent DSM examples include 46,44,23,38,47 , and the software page-based DSM examples include 48,6,13,36,37 . We will focus on how the implementation of distributed shared memory and cache coherence di er on these two architectures.…”

Section: Dsm Implementationmentioning

confidence: 99%

Multigrain shared memory

Yeung

Kubiatowicz

Agarwal

2000

ACM Trans. Comput. Syst.

View full text Add to dashboard Cite

Parallel workstations, each comprising a 10-100 processor shared memory machine, promise cost-e ective general-purpose multiprocessing. This thesis explores the coupling of such small-to medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. Multiprocessors built in this fashion are called Distributed Scalable Shared memory Multiprocessors DSSMPs .The challenge of building DSSMPs lies in seamlessly extending hardware-supported shared memory of each parallel workstation to span a cluster of parallel workstations using software only. Such a shared memory system is called Multigrain Shared Memory because it naturally supports two grains of sharing: ne-grain cache-line sharing within each parallel workstation, and coarse-grain page sharing across parallel workstations. Applications that can leverage the e cient ne-grain support for shared memory provided by each parallel workstation have the potential for high performance.This thesis makes three contributions in the context of Multigrain Shared Memory. First, it provides the design of a multigrain shared memory system, called MGS, and demonstrates its feasibility and correctness via an implementation on a 32-processor Alewife machine. Second, this thesis undertakes an in-depth application study that quanti es the extent to which shared memory applications can leverage e cient shared memory mechanisms provided by DSSMPs. The thesis begins by looking at the performance of unmodi ed shared memory programs, and then investigates application transformations that improve performance. Finally, this thesis presents an approach called Synchronization Analysis for analyzing the performance of multigrain shared memory systems. The thesis develops a performance model based on Synchronization Analysis, and uses the model to study DSSMPs with up to 512 processors. The experiments and analysis demonstrate that scalable DSSMPs can beconstructed from small-scale workstation nodes to achieve competitive performance with large-scale all-hardware shared memory systems. For instance, the model predicts that a 256-processor DSSMP built from 16-processor parallel workstation nodes achieves equivalent performance to a 128-processor all-hardware multiprocessor on a communication-intensive w orkload.

show abstract

Section: Dsm Implementationmentioning

confidence: 99%

Multigrain shared memory

Yeung

Kubiatowicz

Agarwal

2000

ACM Trans. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…Most current multicore architectures do not have this problem since they are not using multithreading (T p = 1) for latency hiding but coherent caches to exploit access locality in programs (where available) or just try to tolerate natural latency defined by the distance of memory access making memory access nonuniform [17,22].…”

Section: Adding Numa Supportmentioning

confidence: 99%

NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures

Forsell

Hansson

Keßler

et al. 2014

IJNC

View full text Add to dashboard Cite

The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing and NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. NUMA Computing with Hardware and Software Co-SupportThe hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

show abstract

“…They propose a set of optimizations that can reach the performance of heavyweight hardware support. These optimizations include write-forwarding [1,19,20,26] at line boundaries, synchronization counters in L2 caches (which they do not describe), and small dedicated receive-side caches for pipelined streaming data in a separate address space. Our design integrates equivalent mechanisms inside general purpose caches, augmented with RDMA for efficient bulk transfers.…”

Section: Related Work and Contributionsmentioning

confidence: 99%

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Kavadias

Katevenis

Zampetakis

et al. 2011

Int J Parallel Prog

View full text Add to dashboard Cite

Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds -the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the 123Int J Parallel Prog performance of synchronization functions with simulation of CMPs with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

show abstract

The Stanford Dash multiprocessor

Cited by 765 publications

References 13 publications

Multigrain shared memory

Multigrain shared memory

NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Contact Info

Product

Resources

About