An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

Woo, Dong Hyuk; Seong, Nak Hee; Lewis, Dean L.; Lee, Hsien-Hsin S.

doi:10.1109/hpca.2010.5416628

Cited by 190 publications

(100 citation statements)

References 39 publications

Supporting

Mentioning

100

Contrasting

Order By: Relevance

“…A different approach is followed by Loh, that in [9] considers 3D-DRAM stacked on top of multi-processors and revises the memory system organization in a 3D context. More recently, also Woo et al [10], have explored a memory architecture that exploits TSVs for connecting the last level cache to the 3D stacked DRAM. The work of Madan et al [11] instead, takes in consideration a 3D system composed by a DRAM layer and an SRAM cache banks layer on top of a processing layer.…”

Section: Related Workmentioning

confidence: 99%

3D-LIN: A configurable low-latency interconnect for multi-core clusters with 3D stacked L1 memory

Beanato

Loi

Micheli

et al. 2012

2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC)

View full text Add to dashboard Cite

Abstract-Shared L1 memories are of interest for tightlycoupled processor clusters in programmable accelerators as they provide a convenient shared memory abstraction while avoiding cache coherence overheads. The performance of a shared-L1 memory critically depends on the architecture of the low-latency interconnect between processors and memory banks, which needs to provide ultra-fast access to the largest possible L1 working set. The advent of 3D technology provides new opportunities to improve the interconnect delay and the form factor. In this paper we propose a network architecture, 3D-LIN, based on 3D integration technology. The network can be configured based on user specifications and technology constraints to provide fast access to L1 memories on multiple stacked dies. The extracted results from the physical synthesis of 3D-LIN permit to explore trade-offs between memory size and network latency from a planar design to multiple memory layers stacked on top of logic. In the case where the system memory requirements lead to a memory area that occupies 60% of the chip, the form factor can be reduced by more than 60% by stacking 2 memory layers on the logic. Latency reduction is also promising: the network itself, configured for connecting 16 processing elements to 128 memory banks on 2 memory layers is 24% faster than the planar system.

show abstract

Section: Related Workmentioning

confidence: 99%

3D-LIN: A configurable low-latency interconnect for multi-core clusters with 3D stacked L1 memory

Beanato

Loi

Micheli

et al. 2012

2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC)

View full text Add to dashboard Cite

show abstract

“…We assume a 3D-stacked DRAM cache that leverages high TSV (through silicon via) bandwidth [27]. We present a latencypower tradeoff with mixed SRAM and DRAM caches (SRAM/DRAM cache + PCRAM main memory).…”

Section: Cache Hierarchy With Heterogeneous Technologiesmentioning

confidence: 99%

Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies

Yoon

Gonzalez

Ranganathan

et al. 2012

Proceedings of the 9th Conference on Computing Frontiers

View full text Add to dashboard Cite

To handle the demand for very large main memory, we are likely to use nonvolatile memory (NVM) as main memory. NVM main memory will have higher latency than DRAM. To cope with this, we advocate a less-deep cache hierarchy based on a large last-level, NVM cache. We develop a model that estimates average memory access time and power of a cache hierarchy. The model is based on captured application behavior, an analytical power and performance model, and circuit-level memory models such as CACTI and NVSim. We use the model to explore the cache hierarchy design space and present latency-power tradeoffs for memory intensive SPEC benchmarks and scientific applications. The results indicate that a flattened hierarchy lowers power and improves average memory access time.

show abstract

“…Other CMPs have been designed in later years exploiting multiple 3-D-DRAM layers [16], [17]; these solutions showed the possibility to reorganize modules and interconnections in order to have a significant bandwidth increase, resulting in a relevant speedup in the routine execution. Loh's [18] solution demonstrated an achievable speed-up of 280% with respect to the baseline CMP (an Intel QuadCore) connected to off-chip DRAM.…”

Section: Related Workmentioning

confidence: 99%

Design and Testing Strategies for Modular 3-D-Multiprocessor Systems Using Die-Level Through Silicon Via Technology

Beanato

Giovannini

Cevrero

et al. 2012

IEEE J. Emerg. Sel. Topics Circuits Syst.

View full text Add to dashboard Cite

show abstract

An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

Cited by 190 publications

References 39 publications

3D-LIN: A configurable low-latency interconnect for multi-core clusters with 3D stacked L1 memory

3D-LIN: A configurable low-latency interconnect for multi-core clusters with 3D stacked L1 memory

Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies

Design and Testing Strategies for Modular 3-D-Multiprocessor Systems Using Die-Level Through Silicon Via Technology

Contact Info

Product

Resources

About