Cache Refill/Access Decoupling for Vector Machines

Batten, Christopher; Krashinsky, Ronny; Gerding, S.; Asanović, Krste

doi:10.1109/micro.2004.9

Cited by 17 publications

(11 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results in Table 8 concur with the insight into realizable memory bandwidth by Batten et al [6]. That is, it is the control and buffering overhead in the processor (reorder buffer entries, physical registers, ld-st queue entries, outstanding cache miss trackers and buffering cost in caches and in the interconnect, etc.…”

Section: Stream and Saxpysupporting

confidence: 88%

Active memory controller

et al. 2012

View full text Add to dashboard Cite

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.The work was done when most of the authors were at the University of Utah. The views and conclusions contained herein are those of the authors and should not be interpreted as representing those, either express or implied, of Intel, CAS, IBM, Chalmers, AMD, nVidia, or the University of Utah. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs' performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×-15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.

show abstract

Section: Stream and Saxpysupporting

confidence: 88%

Active memory controller

et al. 2012

View full text Add to dashboard Cite

show abstract

“…Batten et al have noted that not only the access latency of memory sub-systems but also their bandwidth is very important to improve the application performance [8]. They have proposed an inexpensive non-blocking cache memory for vector architectures to improve the bandwidth and reduce the access latency of memory sub-systems.…”

Section: Related Workmentioning

confidence: 99%

“…As a result, the vector architecture can potentially achieve high computing performance for MMAs. Modern vector architectures usually employ a multibanked cache memory in order to improve their data transfer performance [5]- [8]. The memory subsystem with multibanked cache memory can provide data to parallelized functional units at a sufficient transfer rate.…”

Section: Introductionmentioning

confidence: 99%

MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications

Gao

Sato

Egawa

et al. 2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYVector processors have significant advantages for next generation multimedia applications (MMAs). One of the advantages is that vector processors can achieve high data transfer performance by using a high bandwidth memory sub-system, resulting in a high sustained computing performance. However, the high bandwidth memory sub-system usually leads to enormous costs in terms of chip area, power and energy consumption. These costs are too expensive for commodity computer systems, which are the main execution platform of MMAs. This paper proposes a new multi-banked cache memory for commodity computer systems called MVP-cache in order to expand the potential of vector architectures on MMAs. Unlike conventional multi-banked cache memories, which employ one tag array and one data array in a sub-cache, MVP-cache associates one tag array with multiple independent data arrays of small-sized cache lines. In this way, MVP-cache realizes less static power consumption on its tag arrays. MVP-cache can also achieve high efficiency on short vector data transfers because the flexibility of data transfers can be improved by independently controlling the data transfers of each data array. key words: vector architecture, multimedia application, multi-banked cache memory

show abstract

“…A major job of the compiler is to manage the utilization of the SRF. By contrast, this is much less of a concern for the Scale compiler due to Scale's cached shared memory model and decoupled cache refills [7]. Additionally, the stream processing compiler performs a binary search to determine the best strip-size when strip mining, while this is not an issue for Scale.…”

Section: Related Workmentioning

confidence: 99%

Compiling for vector-thread architectures

Hampton

Asanović

2008

Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization

View full text Add to dashboard Cite

Vector-thread (VT) architectures exploit multiple forms of parallelism simultaneously. This paper describes a compiler for the Scale VT architecture, which takes advantage of the VT features. We focus on compiling loops, and show how the compiler can transform code that poses difficulties for traditional vector or VLIW processors, such as loops with internal control flow or cross-iteration dependences, while still taking advantage of features not supported by multithreaded designs, such as vector memory instructions. We evaluate the compiler using several embedded benchmarks and show that we can obtain substantial speedups over a single-issue, in-order scalar machine.

show abstract

Cache Refill/Access Decoupling for Vector Machines

Abstract: Abstract

Cited by 17 publications

References 26 publications

Active memory controller

Active memory controller

MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications

Compiling for vector-thread architectures

Contact Info

Product

Resources

About