Load Miss Prediction - Exploiting Power Performance Trade-offs

Malkowski, Konrad; Link, G.; Raghavan, Padma; Irwin, M.J.

doi:10.1109/ipdps.2007.370536

Cited by 8 publications

(6 citation statements)

References 21 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We use SimpleScalar configured to accept PISA compiled programs to model a single-core processor (such as the one in BlueGene [18]), starting from a PowerPC440 embedded core. We use Wattch [2] to calculate the power consumption with extrapolations for .13 um technology [11], [15], [16]. We also developed a DDR2 type memory performance and power simulator for use with our modified 1-4244-0910-1/07/$20.00 ©2007 IEEE versions of SimpleScalar and Wattch.…”

Section: Methodsmentioning

confidence: 99%

“…We discuss how memory optimizations that we have developed earlier [11], [15], [16] can affect the performance of tuned and un-tuned versions of sparse matrix vector multiplication. We consider the use of such optimizations with powersaving modes of the hardware such as Dynamic Voltage and Frequency Scaling (DVFS) [5] to improve performance at significantly lower power levels.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Memory Optimizations For Fast Power-Aware Sparse Computations

Malkowski

Raghavan

Irwin

2007

2007 IEEE International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

Abstract-We consider memory subsystem optimizations for improving the performance of sparse scientific computation while reducing the power consumed by the CPU and memory. We first consider a sparse matrix vector multiplication kernel that is at the core of most sparse scientific codes, to evaluate the impact of prefetchers and power-saving modes of the CPU and caches. We show that performance can be improved at significantly lower power levels, leading to over a factor of five improvement in the operations/Joule metric of energy efficiency. We then indicate that these results extend to more complex codes such as a multigrid solver. We also determine a functional representation of the impacts of such optimizations and we indicate how it can be used toward further tuning. Our results thus indicate the potential for cross-layer tuning for multiobjective optimizations by considering both features of the application and the architecture.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Memory Optimizations For Fast Power-Aware Sparse Computations

Malkowski

Raghavan

Irwin

2007

2007 IEEE International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

show abstract

“…Utilizando a criação paralela de requisições, temos o trabalho [23]. Apesar de não ser utilizado uma cache inclusiva, neste trabalho foi utilizado um preditor para decidir se deve ou não requisitar um dado diretamente a memória principal.…”

Section: Trabalhos Relacionadosunclassified

Acelerando requisições de prováveis cache misses com requisições em paralelo cache/DRAM

Köhler¹,

Alves²

2019

Anais Estendidos Do IX Simpósio Brasileiro De Engenharia De Sistemas Computacionais (SBESC Estendido 2019)

View full text Add to dashboard Cite

O uso de hierarquias de memória cache multi-níveis apresentam resultados interessantes quanto a exploração da localidade temporal e espacial no decorrer da execução de um programa. Ao manterem o conjunto de dados mais frequentemente acessados ou os prováveis dados a serem acessados próximos ao processador, as memórias cache proveem um acesso mais rápido ao dados, quando comparado ao acesso a memória principal. Entretanto, para programas com baixa localidade espacial e temporal, a hierarquia de memórias cache pode apresentar-se como uma barreira para a busca de dados na memória principal. Ou seja, para programas que não tiram proveito das memórias cache, essas acabam por adicionar um overhead no acesso aos dados, pois a busca de dados é feita inicialmente na hierarquia de cache antes de ser encaminhada para a memória principal. Por outro lado, os fabricantes de processadores evitam o envio paralelo de requisições de dados para a memória cache e a memória principal, a fim de evitar inundação de falsas requisições no controlador de memória. Com essa perspectiva, nesse artigo nós apresentamos o uso de um preditor simplificado de faltas de dados no LLC que, ao prever se uma requisição acabará por ser um LLC miss, induz o processador a efetuar uma requisição diretamente a memória principal, em paralelo a tradicional busca de dados percorrendo toda a hierarquia de cache. Em nossas simulações, o encaminhamento de requisições diretamente a memória principal, com base em nosso mecanismo apresentou um ganho de desempenho de até 40% quando single-core e até 14% quando multi-core.

show abstract

“…For example, data can be prefetched into dead blocks and while replacing, first preference can be given to dead blocks. The energy overhead of CBTs (e.g., due to predictors) can be offset by using dynamic voltage/frequency scaling (DVFS) technique [70]. 9.…”

Section: Adaptive Bypassingmentioning

confidence: 99%

“…Predictor organization: Many CBTs use predictors (e.g., dead block predictors) for storing metadata and making bypassing decisions. The predictors indexed by PC of memory instructions incur less overhead than those indexed by addresses [20,23,35,39,53,61,70,71].…”

Section: Probabilistic Bypassingmentioning

confidence: 99%

A Survey of Cache Bypassing Techniques

Mittal

2016

JLPEA

View full text Add to dashboard Cite

With increasing core-count, the cache demand of modern processors has also increased. However, due to strict area/power budgets and presence of poor data-locality workloads, blindly scaling cache capacity is both infeasible and ineffective. Cache bypassing is a promising technique to increase effective cache capacity without incurring power/area costs of a larger sized cache. However, injudicious use of cache bypassing can lead to bandwidth congestion and increased miss-rate and hence, intelligent techniques are required to harness its full potential. This paper presents a survey of cache bypassing techniques for CPUs, GPUs and CPU-GPU heterogeneous systems, and for caches designed with SRAM, non-volatile memory (NVM) and die-stacked DRAM. By classifying the techniques based on key parameters, it underscores their differences and similarities. We hope that this paper will provide insights into cache bypassing techniques and associated tradeoffs and will be useful for computer architects, system designers and other researchers.

show abstract

Load Miss Prediction - Exploiting Power Performance Trade-offs

Cited by 8 publications

References 21 publications

Memory Optimizations For Fast Power-Aware Sparse Computations

Memory Optimizations For Fast Power-Aware Sparse Computations

Acelerando requisições de prováveis cache misses com requisições em paralelo cache/DRAM

A Survey of Cache Bypassing Techniques

Contact Info

Product

Resources

About