Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems 1994
DOI: 10.1145/195473.195569
|View full text |Cite
|
Sign up to set email alerts
|

The performance impact of flexibility in the Stanford FLASH multiprocessor

Abstract: A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
23
0
1

Year Published

1996
1996
2004
2004

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 87 publications
(26 citation statements)
references
References 8 publications
2
23
0
1
Order By: Relevance
“…We also fix the access time of main memory DRAM at 140 ns (14 system cycles), resulting in a local read miss time of 190 ns, one system cycle faster than the SGI Origin 2000. Fixing the interface delays and the memory access time is realistic [11] and allows us to focus on the performance of the communication architecture and the effects of varying l; o; g and P .…”
Section: Framework and Methodologymentioning
confidence: 99%
See 1 more Smart Citation
“…We also fix the access time of main memory DRAM at 140 ns (14 system cycles), resulting in a local read miss time of 190 ns, one system cycle faster than the SGI Origin 2000. Fixing the interface delays and the memory access time is realistic [11] and allows us to focus on the performance of the communication architecture and the effects of varying l; o; g and P .…”
Section: Framework and Methodologymentioning
confidence: 99%
“…When the communication controller is simply generating a request into the network or receiving a reply from the network, it incurs occupancy o. When the communication controller is the home of a network request, it incurs occupancy 2o because it has to retrieve data from memory and/or manipulate coherence state information [11]. In this case, we assume the data memory access happens in parallel with the operation of the controller.…”
Section: Occupancymentioning
confidence: 99%
“…The idealized Simple COMA system requires one additional cycle per message, for a total of 301 processor cycles, or about 1.5 µs. For comparison, the Stanford FLASH designers report remote read miss latencies of 1.11 and 1.45 µs, depending on whether the data is dirty in the remote processor's cache [23]. 6 Because these fundamental latencies dominate, Typhoon takes only 33 percent longer to satisfy the miss despite the cost of running software handlers.…”
Section: Microbenchmarkmentioning
confidence: 99%
“…On the caching node, the final step ("fetch data, resume") includes seven bus cycles (28 processor cycles) to fetch the critical word and three processor cycles to forward the data to the CPU and complete the load. The idealized Simple COMA system requires one additional cycle per message, for a total of 301 processor cycles, or about 1.5~s, For comparison, the FLASH designers report remote read miss latencies of 1.11 and 1.45 I.LS, depending on whether the data is dirty in the remote processor's cache [20] We also timed this remote miss on our~phoon-O implementation. The results cannot be directly compared with the simulation because the current platform has slower processors (66 MHz rather than 200 MHz) and a much slower network (a Myricom Myrinet with the interface on the 25 MHz SBUS 1/0 bus).…”
Section: Micro-evaluationmentioning
confidence: 99%