Jung Ho Ahn scite author profile

We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet these off-and on-stack bandwidth requirements at acceptable power levels.Corona is a 3D many-core architecture that uses nanophotonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Its peak floating-point performance is 10 teraflops. Dense wavelength division multiplexed optically connected memory modules provide 10 terabyte per second memory bandwidth. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth. We have simulated a 1024 thread Corona system running synthetic benchmarks and scaled versions of the SPLASH-2 benchmark suite. We believe that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memoryintensive workloads, while simultaneously reducing power.

show abstract

Programmable stream processors

Kapasi

et al. 2003

View full text Add to dashboard Cite

54Computer Programmable Stream Processors T he complexity of modern media processing, including 3D graphics, image compression, and signal processing, requires tens to hundreds of billions of computations per second. To achieve these computation rates, current media processors use special-purpose architectures tailored to one specific application. Such processors require significant design effort and are thus difficult to change as media-processing applications and algorithms evolve.The demand for flexibility in media processing motivates the use of programmable processors. However, very large-scale integration constraints limit the performance of traditional programmable architectures. In modern VLSI technology, computation is relatively cheap-thousands of arithmetic logic units that operate at multigigahertz rates can fit on a modestly sized 1-cm 2 die. The problem is that delivering instructions and data to those ALUs is prohibitively expensive. For example, only 6.5 percent of the Itanium 2 die is devoted to the 12 integer and two floating-point ALUs and their register files 1 ; communication, control, and storage overhead consume the remaining die area. In contrast, the more efficient communication and control structures of a specialpurpose graphics chip, such as the Nvidia GeForce4, enable the use of many hundreds of floating-point and integer ALUs to render 3D images. STREAM PROCESSINGIn part, such special-purpose media processors are successful because media applications have abundant parallelism-enabling thousands of computations to occur in parallel-and require minimal global communication and storage-enabling data to pass directly from one ALU to the next. A stream architecture exploits this locality and concurrency by partitioning the communication and storage structures to support many ALUs efficiently:• operands for arithmetic operations reside in local register files (LRFs) near the ALUs, in much the same way that special-purpose architectures store and communicate data locally; • streams of data capture coarse-grained locality and are stored in a stream register file (SRF), which can efficiently transfer data to and from the LRFs between major computations; and • global data is stored off-chip only when necessary.These three explicit levels of storage form a data bandwidth hierarchy with the LRFs providing an order of magnitude more bandwidth than the SRF and the SRF providing an order of magnitude more bandwidth than off-chip storage. This bandwidth hierarchy is well matched to the characteristics of modern VLSI technology, as each level provides successively more storage and less bandwidth. By exploiting the locality inherent in media-processing applications, this hierarchy stores the data at the appropriate level, enabling hundreds of ALUs to operate at close to their peak rate.Moreover, a stream architecture can support such a large number of ALUs in an area-and power-efficient manner. Modern high-performance microStream processing promises to bridge the gap between inflexible specialpurpose ...

show abstract

CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques

et al. 2011

View full text Add to dashboard Cite

This paper introduces CACTI-P, the first architecture-level integrated power, area, and timing modeling framework for SRAM-based structures with advanced leakage power reduction techniques. CACTI-P supports modeling of major leakage power reduction approaches including power-gating, long channel devices, and Hi-k metal gate devices. Because it accounts for implementation overheads, CACTI-P enables indepth study of architecture-level tradeoffs for advanced leakage power management schemes. We illustrate the potential applicability of CACTI-P in the design and analysis of leakage power reduction techniques of future manycore processors by applying nanosecond scale power-gating to different levels of cache for a 64 core multithreaded architecture at the 22nm technology. Combining results from CACTI-P and a performance simulator, we find that although nanosecond scale power-gating is a powerful way to minimize leakage power for all levels of caches, its severe impacts on processor performance and energy when being used for L1 data caches make nanosecond scale power-gating a better fit for caches closer to main memory.

show abstract

Future scaling of processor-memory interfaces

et al. 2009

View full text Add to dashboard Cite

Continuous evolution in process technology brings energyefficiency and reliability challenges, which are harder for memory system designs since chip multiprocessors demand high bandwidth and capacity, global wires improve slowly, and more cells are susceptible to hard and soft errors. Recently, there are proposals aiming at better main-memory energy efficiency by dividing a memory rank into subsets.We holistically assess the effectiveness of rank subsetting in the context of system-wide performance, energy-efficiency, and reliability perspectives. We identify the impact of rank subsetting on memory power and processor performance analytically, then verify the analyses by simulating a chipmultiprocessor system using multithreaded and consolidated workloads. We extend the design of Multicore DIMM, one proposal embodying rank subsetting, for high-reliability systems and show that compared with conventional chipkill approaches, it can lead to much higher system-level energy efficiency and performance at the cost of additional DRAM devices.

show abstract

CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory

Chen

Muralimanohar

et al. 2012

View full text Add to dashboard Cite

A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies

et al. 2008

View full text Add to dashboard Cite

In this paper we introduce CACTI-D, a significant enhancement of CACTI 5.0. CACTI-D adds support for modeling of commodity DRAM technology and support for main memory DRAM chip organization. CACTI-D enables modeling of the complete memory hierarchy with consistent models all the way from SRAM based L1 caches through main memory DRAMs on DIMMs.We illustrate the potential applicability of CACTI-D in the design and analysis of future memory hierarchies by carrying out a last level cache study for a multicore multithreaded architecture at the 32nm technology node. In this study we use CACTI-D to model all components of the memory hierarchy including L1, L2, last level SRAM, logicprocess based DRAM or commodity DRAM L3 caches, and main memory DRAM chips. We carry out architectural simulation using benchmarks with large data sets and present results of their execution time, breakdown of power in the memory hierarchy, and system energy-delay product for the different system configurations. We find that commodity DRAM technology is most attractive for stacked last level caches, with significantly lower energy-delay products.

show abstract

Palbociclib plus exemestane with gonadotropin-releasing hormone agonist versus capecitabine in premenopausal women with hormone receptor-positive, HER2-negative metastatic breast cancer (KCSG-BR15-10): a multicentre, open-label, randomised, phase 2 trial

Park¹,

Kim²,

Kim³

et al. 2019

The Lancet Oncology

View full text Add to dashboard Cite

McPAT

Ahn

Strong

et al. 2009

1,863

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jung Ho Ahn

Corona: System Implications of Emerging Nanophotonic Technology

Programmable stream processors

CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques

Future scaling of processor-memory interfaces

CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory

A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies

Palbociclib plus exemestane with gonadotropin-releasing hormone agonist versus capecitabine in premenopausal women with hormone receptor-positive, HER2-negative metastatic breast cancer (KCSG-BR15-10): a multicentre, open-label, randomised, phase 2 trial

McPAT

Contact Info

Product

Resources

About