André Seznec scite author profile

International audienceDedicating more silicon area to single thread perfor-mance will necessarily be considered as worthwhile in fu-ture – potentially heterogeneous – multicores. In particular, Value prediction (VP) was proposed in the mid 90's to en-hance the performance of high-end uniprocessors by break-ing true data dependencies. In this paper, we reconsider the concept of Value Predic-tion in the contemporary context and show its potential as a direction to improve current single thread performance. First, building on top of research carried out during the pre-vious decade on confidence estimation, we show that every value predictor is amenable to very high prediction accu-racy using very simple hardware. This clears the path to an implementation of VP without a complex selective reis-sue mechanism to absorb mispredictions. Prediction is per-formed in the in-order pipeline frond-end and validation is performed in the in-order pipeline back-end, while the out-of-order engine is only marginally modified. Second, when predicting back-to-back occurrences of the same instruction, previous context-based value predictors relying on local value history exhibit a complex critical loop that should ideally be implemented in a single cycle. To bypass this requirement, we introduce a new value predic-tor VTAGE harnessing the global branch history. VTAGE can seamlessly predict back-to-back occurrences, allowing predictions to span over several cycles. It achieves higher performance than previously proposed context-based pre-dictors. Specifically, using SPEC'00 and SPEC'06 benchmarks, our simulations show that combining VTAGE and a stride-based predictor yields up to 65% speedup on a fairly aggressive pipeline without support for selective reissu

show abstract

Zero-content augmented caches

Dusser

Piquet

Seznec

2009

View full text Add to dashboard Cite

It has been observed that some applications manipulate large amounts of null data. Moreover these zero data often exhibit high spatial locality. On some applications more than 20% of the data accesses concern null data blocks. Representing a null block in a cache on a standard cache line appears as a waste of resources.In this paper, we propose the Zero-Content Augmented cache, the ZCA cache. A ZCA cache consists of a conventional cache augmented with a specialized cache for memorizing null blocks, the Zero-Content cache or ZC cache. In the ZC cache, the data block is represented by its address tag and a validity bit. Moreover, as null blocks generally exhibit high spatial locality, several null blocks can be associated with a single address tag in the ZC cache.For instance, a ZC cache mapping 32MB of zero 64-byte lines uses less than 80KB of storage. Decompression of a null block is very simple, therefore read access time on the ZCA cache is in the same range as the one of a conventional cache. On applications manipulating large amount of null data blocks, such a ZC cache allows to significantly reduce the miss rate and memory traffic, and therefore to increase performance for a small hardware overhead. In particular, the write-back traffic on null blocks is limited. For applications with a low null block rate, no performance loss is observed.

show abstract

Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio

Seznec

View full text Add to dashboard Cite

Choosing Representative Slices of Program Execution for Microarchitecture Simulations: A Preliminary Application to the Data Stream

Lafage

Seznec

2001

View full text Add to dashboard Cite

Skewed Compressed Caches

Sardashti

Seznec

Wood

2014

View full text Add to dashboard Cite

Abstract-Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals.In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks superblocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., no backward pointers). SCC does this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.

show abstract

Tarantula: a vector extension to the alpha architecture

Espasa¹,

Ardanaz²,

Emer³

et al.

View full text Add to dashboard Cite

Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6,5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.

show abstract

Data-flow prescheduling for large instruction windows in out-of-order processors

Michaud

Seznec

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

André Seznec

Design tradeoffs for the alpha EV8 conditional branch predictor

Practical data value speculation for future high-end processors

Zero-content augmented caches

Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio

Choosing Representative Slices of Program Execution for Microarchitecture Simulations: A Preliminary Application to the Data Stream

Skewed Compressed Caches

Tarantula: a vector extension to the alpha architecture

Data-flow prescheduling for large instruction windows in out-of-order processors

Contact Info

Product

Resources

About