High-Performance Throughput Computing

Chaudhry, S.; Caprioli, P.; Yip, S.; Tremblay, M.

doi:10.1109/mm.2005.49

Cited by 70 publications

(30 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This estimate is consistent with the current core-size differences in 45 nm for the out-of-order Penryn and the in-order Silverthorne, and is more conservative than the 5x area reduction reported by Asanovic et al [3]. We then assume a multithreading area overhead of 10% as reported in Chaudry et al [7]. Total die area for the processor and L1 die is estimated to be between 423 mm 2 (Penryn based) and 491 mm 2 (Silverthorne based).…”

Section: Coressupporting

confidence: 89%

Corona: System Implications of Emerging Nanophotonic Technology

Vantrease

Schreiber

Monchiero

et al. 2008

2008 International Symposium on Computer Architecture

428

479

View full text Add to dashboard Cite

We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet these off-and on-stack bandwidth requirements at acceptable power levels.Corona is a 3D many-core architecture that uses nanophotonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Its peak floating-point performance is 10 teraflops. Dense wavelength division multiplexed optically connected memory modules provide 10 terabyte per second memory bandwidth. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth. We have simulated a 1024 thread Corona system running synthetic benchmarks and scaled versions of the SPLASH-2 benchmark suite. We believe that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memoryintensive workloads, while simultaneously reducing power.

show abstract

Section: Coressupporting

confidence: 89%

Corona: System Implications of Emerging Nanophotonic Technology

Vantrease

Schreiber

Monchiero

et al. 2008

2008 International Symposium on Computer Architecture

428

479

View full text Add to dashboard Cite

show abstract

“…Hardware Scouting, described by Chaudhry et al [7], is an extension of runahead execution, that includes several optimizations to previous runahead proposals. In hardware scouting, launching and exiting out of runahead is a zerolatency operation and runahead mode is also entered on low latency misses (L2 hits).…”

Section: Runahead Execution and Hardware Scoutmentioning

confidence: 99%

“…As the memory wall problem has come to overshadow other aspects of processing, various forms of runahead execution have been proposed [21][12] [7][3] [4]. Runahead execution attempts to reduce the effect of the long memory latencies by increasing the memory-level parallelism.…”

Section: Introductionmentioning

confidence: 99%

Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution

Karlsson

Hägersten

2007

2007 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

show abstract

“…Techniques for reducing the frequency and impact of cache misses include hardware and software prefetching (Chen andBauer 1994, Klaiber andLevy 1991), speculative loads and execution (Rogers at al. 1992) and multithreading (Agarwal 1992;Byrd and Holliday 1995;Ungerer et al 2003, Chaudhry et al 2005, Emer et al 2007.…”

Section: Introductionmentioning

confidence: 99%

Performance limitations of block-multithreaded distributed-memory systems

Zuberek¹

2009

Proceedings of the 2009 Winter Simulation Conference (WSC)

View full text Add to dashboard Cite

The performance of modern computer systems is increasingly often limited by long latencies of accesses to the memory subsystems. Instruction-level multithreading is an architectural approach to tolerating such long latencies by switching instruction threads rather than waiting for the completion of memory operations. The paper studies performance limitations in distributed-memory block multithreaded systems and determines conditions for such systems to be balanced. Eventdriven simulation of a timed Petri net model of a simple distributed-memory system confirms the derived performance results.

show abstract

High-Performance Throughput Computing

Cited by 70 publications

References 13 publications

Corona: System Implications of Emerging Nanophotonic Technology

Corona: System Implications of Emerging Nanophotonic Technology

Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution

Performance limitations of block-multithreaded distributed-memory systems

Contact Info

Product

Resources

About