Specialized dynamic optimizations for high-performance energy-efficient microarchitecture

Almog, Y.; Rosner, R.; Schwartz, N.; Schmorak, A.

doi:10.1109/cgo.2004.1281670

Cited by 12 publications

(13 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As such, we designed Federation without further plans for horizontally aggregating more than two cores into a single very wide core. For higher single-thread performance, the combination of Federation with techniques that can effectively shorten the critical path-such as runahead execution [Mutlu et al 2003], sophisticated prefetchers [Ganusov and Burtscher 2006], or dynamic optimization [Almog et al 2004]-seems to be the most fruitful path to pursue. An advantage of many such techniques is their toleration of infrequent or long latency communication with the main core, which makes it much easier to implement them using multiple cores of a manycore processor.…”

Section: Discussionmentioning

confidence: 99%

Federation

Boyer

Tarjan

Skadron

2010

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Manycore architectures designed for parallel workloads are likely to use simple, highly multithreaded, in-order cores. This maximizes throughput, but only with enough threads to keep hardware utilized. For applications or phases with more limited parallelism, we describe creating an out-of-order processor on-the-fly, by federating two neighboring in-order cores. We reuse the large register file in the multithreaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables. The resulting federated core provides twice the single-thread performance of the underlying in-order core, allowing the architecture to efficiently support a wider range of parallelism. ACM Reference Format:Boyer, M., Tarjan, D., and Skadron, K. 2010. Federation: Boosting per-thread performance of throughput-oriented manycore architectures. ACM Trans.

show abstract

Section: Discussionmentioning

confidence: 99%

Federation

Boyer

Tarjan

Skadron

2010

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…The recent rePlay [12], [13], [14] and PARROT [15], [16] frameworks enable very aggressive hardware optimizations, by using a dynamically configurable optimization engine running in parallel with a high performance execution core. The key idea in these frameworks is the atomic execution of traces.…”

Section: A Trace Cache Optimizationsmentioning

confidence: 99%

Dynamic Code Value Specialization Using the Trace Cache Fill Unit

Zhang

Checkoway

Calder

et al. 2006

2006 International Conference on Computer Design

View full text Add to dashboard Cite

Abstract-Value specialization is a technique which can improve a program's performance when its code frequently takes the same values. In this paper, speculative value specialization is applied dynamically by utilizing the trace cache hardware. We implement a small, efficient hardware profiler to identify loads that have semi-invariant runtime values. A specialization engine off the program's critical path generates highly optimized traces using these values, which reside in the trace cache. Specialized traces are dynamically verified during execution, and mis-specialization is recovered automatically without new hardware overhead. Our simulation shows that dynamic value specialization in the trace cache achieves a 17% speedup, even over a system with support for hardware value prediction. When combined with other techniques aimed at tolerating memory latencies, this technique still performs well -this technique combined with an aggressive hardware prefetcher achieves 24% better performance than prefetching alone.

show abstract

“…Architecture research has produced a wide variety microarchitectural, predictor-based optimizations, including value prediction [16,17,24] instruction reuse [26], hardware prefetching [4,7,13,14,18,25,31,27,32], dynamic program optimization [1,19,23,35], pointer caching [8], and cache replacement policies, e.g., [20]. These techniques collect metadata information at runtime about the application's behavior and store it in on-chip buffers or lookup tables.…”

Section: Introductionmentioning

confidence: 99%

Predictor virtualization

Burcea

Somogyi

Moshovos

et al. 2008

SIGOPS Oper. Syst. Rev.

View full text Add to dashboard Cite

Many hardware optimizations rely on collecting information about program behavior at runtime. This information is stored in lookup tables. To be accurate and effective, these optimizations usually require large dedicated on-chip tables. Although technology advances offer an increased amount of on-chip resources, these resources are allocated to increase the size of on-chip conventional cache hierarchies.This work proposes Predictor Virtualization, a technique that uses the existing memory hierarchy to emulate large predictor tables. We demonstrate the benefits of this technique by virtualizing a state-of-the-art data prefetcher. Full-system, cycle-accurate simulations demonstrate that the virtualized prefetcher preserves the performance benefits of the original design, while reducing the on-chip storage dedicated to the predictor table from 60KB down to less than one kilobyte.

show abstract

Specialized dynamic optimizations for high-performance energy-efficient microarchitecture

Abstract: We study several

Cited by 12 publications

References 26 publications

Federation

Federation

Dynamic Code Value Specialization Using the Trace Cache Fill Unit

Predictor virtualization

Contact Info

Product

Resources

About