The performance impact of flexibility in the Stanford FLASH multiprocessor

Heinrich, Mark; Horowitz, Mark; Gupta, Anoop; Rosenblum, Mendel; Hennessy, John L.; Kuskin, Jeffrey S.; Ofelt, David; Heinlein, John; Baxter, Joel; Singh, Jaswinder Pal; Simoni, Richard; Gharachorloo, Kourosh; Nakahira, D.

doi:10.1145/195473.195569

Cited by 87 publications

(26 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We also fix the access time of main memory DRAM at 140 ns (14 system cycles), resulting in a local read miss time of 190 ns, one system cycle faster than the SGI Origin 2000. Fixing the interface delays and the memory access time is realistic [11] and allows us to focus on the performance of the communication architecture and the effects of varying l; o; g and P .…”

Section: Framework and Methodologymentioning

confidence: 99%

“…When the communication controller is simply generating a request into the network or receiving a reply from the network, it incurs occupancy o. When the communication controller is the home of a network request, it incurs occupancy 2o because it has to retrieve data from memory and/or manipulate coherence state information [11]. In this case, we assume the data memory access happens in parallel with the operation of the controller.…”

Section: Occupancymentioning

confidence: 99%

See 1 more Smart Citation

Latency, occupancy, and bandwidth in dsm multiprocessors: a performance evaluation

Chaudhuri

Heinrich

Holt³

et al. 2003

IEEE Trans. Comput.

View full text Add to dashboard Cite

Abstract-While the desire to use commodity parts in the communication architecture of a DSM multiprocessor offers advantages in cost and design time, the impact on application performance is unclear. We study this performance impact through detailed simulation, analytical modeling, and experiments on a flexible DSM prototype, using a range of parallel applications. We adapt the logP model to characterize the communication architectures of DSM machines. The l (network latency) and o (controller occupancy) parameters are the keys to performance in these machines, with the g (node-to-network bandwidth) parameter becoming important only for the fastest controllers. We show that, of all the logP parameters, controller occupancy has the greatest impact on application performance. Of the two contributions of occupancy to performance degradation-the latency it adds and the contention it induces-it is the contention component that governs performance regardless of network latency, showing a quadratic dependence on o. As expected, techniques to reduce the impact of latency make controller occupancy a greater bottleneck. Surprisingly, the performance impact of occupancy is substantial, even for highly-tuned applications and even in the absence of latency hiding techniques. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. Through experiments on a DSM prototype, we show that there are important classes of applications for which the performance lost by using higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.

show abstract

Section: Framework and Methodologymentioning

confidence: 99%

Section: Occupancymentioning

confidence: 99%

Latency, occupancy, and bandwidth in dsm multiprocessors: a performance evaluation

Chaudhuri

Heinrich

Holt³

et al. 2003

IEEE Trans. Comput.

View full text Add to dashboard Cite

show abstract

“…The idealized Simple COMA system requires one additional cycle per message, for a total of 301 processor cycles, or about 1.5 µs. For comparison, the Stanford FLASH designers report remote read miss latencies of 1.11 and 1.45 µs, depending on whether the data is dirty in the remote processor's cache [23]. 6 Because these fundamental latencies dominate, Typhoon takes only 33 percent longer to satisfy the miss despite the cost of running software handlers.…”

Section: Microbenchmarkmentioning

confidence: 99%

Hardware support for flexible distributed shared memory

Reinhardt

Pfile²,

Wood

1998

IEEE Trans. Comput.

View full text Add to dashboard Cite

Abstract-Workstation-based parallel systems are attractive due to their low cost and competitive uniprocessor performance. However, supporting a cache-coherent global address space on these systems involves significant overheads. We examine two approaches to coping with these overheads. First, DSM-specific hardware can be added to the off-the-shelf component base to reduce overheads. Second, application-specific coherence protocols can avoid some overheads by exploiting programmer (or compiler) knowledge of an application's communication patterns. To explore the interaction between these approaches, we simulated four designs that add DSM acceleration hardware to a collection of off-the-shelf workstation nodes. Three of the designs support user-level software coherence protocols, enabling application-specific protocol optimizations. To verify the feasibility of our hardware approach, we constructed a prototype of the simplest design. Measured speedups from the prototype match simulation results closely. We find that, even with aggressive DSM hardware support, custom protocols can provide significant speedups for some applications. In addition, the custom protocols are generally effective at reducing the impact of other overheads, including those due to less aggressive hardware support and larger network latencies. However, for three of our benchmarks, the additional hardware acceleration provided by our most aggressive design avoids the need to develop more efficient custom protocols.

show abstract

“…On the caching node, the final step ("fetch data, resume") includes seven bus cycles (28 processor cycles) to fetch the critical word and three processor cycles to forward the data to the CPU and complete the load. The idealized Simple COMA system requires one additional cycle per message, for a total of 301 processor cycles, or about 1.5~s, For comparison, the FLASH designers report remote read miss latencies of 1.11 and 1.45 I.LS, depending on whether the data is dirty in the remote processor's cache [20] We also timed this remote miss on our~phoon-O implementation. The results cannot be directly compared with the simulation because the current platform has slower processors (66 MHz rather than 200 MHz) and a much slower network (a Myricom Myrinet with the interface on the 25 MHz SBUS 1/0 bus).…”

Section: Micro-evaluationmentioning

confidence: 99%

Decoupled hardware support for distributed shared memory

Reinhardt

Pfile

Wood

1996

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

This paper investigates hardware support for fine-grain distributed shared memory (DSM) in networks of workstations.To reduce design time and implementation cost relative to dedicated DSM systems, we decouple the functional hardware components of DSM support, allowing greater use of off-the-shelf devices.We present two decoupled systems, Typhoon-O and Typhoon-1. Typhoon-O uses an off-the-shelf protocol processor and network interface; a custom access control device is the only DSM-specific hardware. To demonstrate the feasibility and simplicity of this access control device, we designed and built au FPGA-based version in under one year. Typhoon-1 also uses an off-the-shelf protocol processor, but integrates the network interface and access control devices for higher performance.We compare the performance of the two decoupled systems with two integrated systems via simulation. For six benchmarks on 32 nodes,~phoon-O ranges from 30% to 309% slower than the best integrated system, while Typhoon-1 ranges from 13% to 132% slower. Four of the six benchmarks achieve speedups of 12 to 18 on Typhoon-O and 15 to 26 on Typhoon-1, compared with 19 to 35 on the best integrated system. Two benchmarks are hampered by high communication overheads, but selectively replacing shared-memory operations with message passing provides speedups of at least 16 on both decoupled systems. These speedups indicate that decoupled designs can potentially provide a cost-effective alternative to complex high-end DSM systems.

show abstract

The performance impact of flexibility in the Stanford FLASH multiprocessor

Cited by 87 publications

References 8 publications

Latency, occupancy, and bandwidth in dsm multiprocessors: a performance evaluation

Latency, occupancy, and bandwidth in dsm multiprocessors: a performance evaluation

Hardware support for flexible distributed shared memory

Decoupled hardware support for distributed shared memory

Contact Info

Product

Resources

About