I NTRODUCTIONIn response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, evaluating each in terms of its effect on total execution time. While there are a number of academic proposals for new DRAM designs, space limits us to covering only existing commercial architectures. To obtain accurate memory-request timing for an aggressive out-of-order processor, we integrate our code into the SimpleScalar tool set [4]. This paper presents a baseline study of a small-system DRAM organization : these are systems with only a handful of DRAM chips (0.1-1GB). We do not consider large-system DRAM organizations with many gigabytes of storage that are highly interleaved. We also study a set of benchmarks that are appropriate for such systems: user-class applications such as compilers and small databases rather than server-class applications such as transaction processing systems. The study asks and answers the following questions:• What is the effect of improvements in DRAM technology on the memory latency and bandwidth problems?Contemporary techniques for improving processor performance and tolerating memory latency are exacerbating the memory bandwidth problem [5]. Our results show that current DRAM architectures are attacking exactly this problem: the most recent technologies (SDRAM, ESDRAM, DDR, and Rambus) have reduced the stall time due to limited bandwidth by a factor of three compared to earlier DRAM architectures. However, the memory-latency component of overhead has not improved.• Where is time spent in the primary memory system (the memory system beyond the cache hierarchy, but not including secondary [disk] or tertiary [backup] storage)? What is the performance benefit of exploiting the page mode of contemporary DRAMs?For the newer DRAM designs, the time to extract the required data from the sense amps/row caches for transmission on the memory bus is the largest component in the average access time, though page mode allows this to be overlapped with column access and the time to transmit the data over the memory bus.• How much locality is there in the address stream that reaches the primary memory system?The stream of addresses that miss the L2 cache contains a significant amount of locality, as measured by the hit-rates in the DRAM row buffers. The hit rates for the applications studied range 2-97%, with a mean hit rate of 40% for a 1MB L2 cache. (This does not include hits to the row buffers when making multiple DRAM requests to read one cache-line.) High-Performance DRAMs in Workstation EnvironmentsVinodh Cuppu, Student Member, IEEE, Bruce Jacob, Member, IEEE, Brian Davis, Member, IEEE, Trevor Mudge, Fellow, IEEE Abstract -This paper presents a simulation-based performance study of several of the new high-performance DRAM architectures, each evaluated in a small system organization. These small-system organizations correspond to workstation-class c...
Today's digital signal processors (DSPs), unlike general-purpose processors, use a non-uniform addressing model in which the primary components of the memory system-the DRAM and dual tagless SRAMs-are referenced through completely separate segments of the address space. The recent trend of programming DSPs in highlevel languages instead of assembly code has exposed this memory model as a potential weakness, as the model makes for a poor compiler target. In many of today's high-performance DSPs this non-uniform model is being replaced by a uniform model-a transparent organization like that of most general-purpose systems, in which all memory structures share the same address space as the DRAM system.In such a memory organization, one must replace the DSP's tagless SRAMs with something resembling a general-purpose cache. This study investigates the performance of a range of traditional and slightly non-traditional cache organizations for a high-performance DSP, the Texas Instruments 'C6000 VLIW DSP. The traditional cache organizations range from a fraction of a kilobyte to several kilobytes; they approach the SRAM performance and, for some benchmarks, beat it. In the non-traditional cache organizations, rather than simply adding tags to the large on-chip SRAM structure, we take advantage of the relatively regular memory access behavior of most DSP applications and replace the tagless SRAM with a neartraditional cache that uses a very small number of wide blocks. This performs similarly to the traditional caches but uses less storage. In general, we find that one can achieve nearly the same performance as a tagless SRAM while using a much smaller footprint.
Given aJbced CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters" include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and organizations, turnaround overhead, memory-controller page protocol, algorithms .for assigning request priorities and scheduling requests dynamically, etc. In this design space, we see a wide variation in application execution times;for example, execution times for SPEC CPU 2000 integer suite on a 2-way ganged Direct Rambus organization (32 data bits) with 64-byte bursts are 10-20% lower than execution times on an otherwise identical configuration that uses 32byte bursts. This represents two system configurations that are relatively close to each other in the design space; performance differences become even more pronounced for designs further apart. This paper characterizes the sources of overhead in high-performance DRAM systems and investigates the most effective ways to reduce a system's exposure to performance loss. In particular, we look at mechanisms to increase a system's support for concurrent transactions, mechanisms to reduce request latency, and mechanisms to reduce the "system overhead"--the portion of the primary memory system's overhead that is not due to DRAM latency but rather to things like turnaround time, request queueing, inefficiencies due to read~write request interleaving, etc. Our simulator models a 2GHz, highly aggressive out-of-order uniprocessor. The interJhce to the memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-I and level-2 caches and split-transaction busses to all DRAM banks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.