Distributed prefetch-buffer/cache design for high performance memory systems

Alexander, Thomas M.; Kedem, G.

doi:10.1109/hpca.1996.501191

Cited by 45 publications

(56 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although these traces are from the obsolete SPEC92 benchmarks, they are sufficient to warm up the size 1 The traces used in this paper can be found at ftp: //tracebase.nmsu.edu/pub/traces/uni/r2000/ of cache used here, because 1.1-billion references are used, with traces interleaved to create the effect of a multiprogramming workload. Traces of context-switching code and TLB management code are interleaved as appropriate.…”

Section: Inputs and Variationsmentioning

confidence: 99%

“…Prefetch requires loading a cache block before it is requested, either by hardware [7,20] or with compiler support [28]; predictive prefetch attempts to improve accuracy of prefetch for relatively varied memory access patterns [1]. In critical word first, the word containing the reference which caused the miss is fetched first, followed by the rest of the block [13].…”

Section: Alternativesmentioning

confidence: 99%

See 1 more Smart Citation

L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy

Machanick

Patel

2003

Advances in Computer Systems Architecture

View full text Add to dashboard Cite

The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, and uses the TLB to cache page translations in that main memory. Earlier RAMpage evaluation used a relatively small L1 cache and TLB. For the RAMpage hierarchy, the effect of a small TLB is clearer than with a conventional hierarchy, because it is more feasible to make a TLB which maps a high fraction of main memory pages. This paper illustrates how more aggressive components higher in the memory hierarchy make time spent waiting for DRAM more significant as a fraction of total execution time, and, hence, approaches to hide the latency of DRAM become more important. For an instruction issue rate of 1 GHz, the simulated standard hierarchy waited for DRAM 10% of the time; with the instruction issue rate increased to 8 GHz, the fraction of time spent waiting for DRAM increased to 40%, and was higher for a larger L1 cache. The RAMpage hierarchy with context switches on misses was able to hide almost all DRAM latency. Increasing the processor speed in a standard hierarchy by a factor of 8 and increasing L1 cache size by a factor of 16, with DRAM speed unchanged, resulted in a speedup of 6.12. Adding the RAMpage model and introducing context switches on misses, with similar processor speed and L1 improvements, resulted in a speedup of 10.7 over the slowest conventional hierarchy, or 1.74 for the best case of the conventional hierarchy with the fastest processor. A larger TLB was shown to increase the viable range of RAMpage SRAM page sizes.

show abstract

Section: Inputs and Variationsmentioning

confidence: 99%

Section: Alternativesmentioning

confidence: 99%

L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy

Machanick

Patel

2003

Advances in Computer Systems Architecture

View full text Add to dashboard Cite

show abstract

“…Alexander and Kedem [1] describe a memory-based prefetching scheme that can significantly improve the performance of some applications. They use a prediction table to store up to four possible "next-access" predictions for any given memory address.…”

Section: Related Workmentioning

confidence: 99%

Impulse: Memory System Support for Scientific Applications

Carter

Hsieh

Stoller

et al. 1999

Scientific Programming

View full text Add to dashboard Cite

Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems.In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can improve the performance of memory-bound scientific applications. For instance, Impulse decreases the running time of the NAS conjugate gradient benchmark by 67%. We expect that Impulse will also benefit regularly strided, memory-bound applications of commercial importance, such as database and multimedia programs.

show abstract

“…Software prefetching [Callahan et al 1991] [Porterfield 1989] [Klaiber and Levy 1991] [Mowry et al 1992] exploits compile-time information to insert prefetch instructions in a program. Correlation-based prefetching [Joseph and Grunwald 1997] [Alexander and Kedem 1996] also relies the address history to predict future references, but they can capture complex access patterns. The prediction accuracy relies on the size of the prediction table and stable access patterns.…”

Section: Related Workmentioning

confidence: 99%

Tolerating memory latency through push prefetching for pointer-intensive applications

Yang

Lebeck

Tseng

et al. 2004

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In this paper, we proposed a cooperative hardware/software prefetching framework, the push architecture, which is designed specifically for linked data structures. The push architecture exploits program structure for future address generation instead of relying on past address history. It identifies the load instructions that traverse a LDS and uses a prefetch engine to execute them ahead of the CPU execution. This allows the prefetch engine to successfully generate future addresses. To overcome the serial nature of LDS address generation, the push architecture employs a novel data movement model. It attaches the prefetch engine to each level of the memory hierarchy and pushes, rather than pulls, data to the CPU. This push model decouples the pointer dereference from the transfer of the current node up to the processor. Thus a series of pointer dereferences becomes a pipelined process rather than a serial process. Simulation results show that the push architecture can reduce up to 100% of memory stall time on a suite of pointer-intensive applications, reducing overall execution time by an average 15%.

show abstract

Distributed prefetch-buffer/cache design for high performance memory systems

Cited by 45 publications

References 14 publications

L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy

L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy

Impulse: Memory System Support for Scientific Applications

Tolerating memory latency through push prefetching for pointer-intensive applications

Contact Info

Product

Resources

About