Energy and performance improvements in microprocessor design using a loop cache

Bellas, Nikolaos; Hajj, I.N.; Polychronopoulos, Constantine D.; Stamoulis, George

doi:10.1109/iccd.1999.808570

Cited by 55 publications

(47 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also, Weiyu Tang et al [11] introduce a Decoder Filter Cache in the instruction memory organization in order to reduce the use of the instruction fetch and decode logic by providing directly decoded instructions to the processor. On the other hand, Nikolaos Bellas et al [12] propose a scheme, where the compiler generates code annotations in order to reduce the possibility of a miss in the loop buffer cache. The drawback of this work is the trade-off between the performance degradation and the power savings, which is created by the selection of the basic blocks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

[2010] Energy Efficiency Using Loop Buffer based Instruction Memory Organizations

Artes

Duarte

Ashouei

et al. 2010

2010 International Workshop on Innovative Architecture for Future Generation High Performance

View full text Add to dashboard Cite

Abstract-Energy consumption in embedded systems is strongly dominated by instruction memory organizations. Based on this, any architectural enhancement introduced in this component will produce a significant reduction of the total energy budget of the system. Loop buffering is an effective scheme to reduce the energy consumption of the instruction memory organization. In this paper, a novel classification of architectural enhancements based on the use of loop buffer concept is presented. Using this classification, an energy design space exploration is performed to show the impact in the energy consumption on different application scenarios. From gate-level simulations, the energy analysis demonstrates that the instruction level paralellism of the system brings not only improvements in performance, but also improvements in the energy consumption of the system. The increase in instruction level paralellism makes easy the adaptation of the sizes of the loop buffers to the sizes of the loops that form the application, because gives more freedom to combine the execution of the loops that form the application.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The architectural model that represents it is the central loop buffer architecture for single processor organization. References [5], [6], [7], [8], [9], [10], [11] and [12] are examples of the work done in this set of architectures.…”

Section: Related Workmentioning

confidence: 99%

[2010] Energy Efficiency Using Loop Buffer based Instruction Memory Organizations

Artes

Duarte

Ashouei

et al. 2010

2010 International Workshop on Innovative Architecture for Future Generation High Performance

View full text Add to dashboard Cite

show abstract

“…VLIW instruction scheduling has been studied [16][17][18][19] whereas others have considered dynamic voltage scaling techniques [20] and the use of compiler controlled caches for frequently executed code [21]. superscalar processors, most contributions have considered dynamic voltage scaling techniques [20,22].…”

Section: Related Workmentioning

confidence: 99%

Compiler Directed Issue Queue Energy Reduction

Jones

O’Boyle

Abella

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The issue logic of a superscalar processor consumes a large amount of static and dynamic energy. Furthermore, its power density makes it a hot-spot requiring expensive cooling systems and additional packaging. This paper presents a novel approach to energy reduction that uses compiler analysis communicated to the hardware, allowing the processor to dynamically resize the issue queue, fitting it to the available ILP without slowing down the critical path. Limiting the entries available reduces the quantity of instructions dispatched, leading to energy savings in the banked issue queue without adversely affecting performance. Compared with a recently proposed hardware scheme, our approach is faster, simpler and saves more energy. A simplistic scheme achieves 31% dynamic and 33% static energy savings in the issue queue with a 7.2% performance loss. Using more sophisticated compiler analysis we then show that the performance loss can be reduced to less than 0.6% with 24% dynamic and 30% static energy savings and an EDD product of 0.96, outperforming two current state-of-the-art hardware approaches.

show abstract

“…Dynamic binary translation methods profile in order to store the translation results of frequent code regions, for improved performance as well as power [13], while dynamic optimization methods search for the hottest blocks for runtime recompilation [3]. The approaches in [4][9] use profiling to detect frequent loops to map to a special address region that an architecture would then map to a small low-power loop cache, while the approach in [12] compresses those regions to reduce memory traffic and hence power. The approach in [6] profiles values of variables or subroutine parameters to detect pseudo-constants that can aid a compiler in optimizing for performance, or even for reduced energy [7].…”

Section: Introductionmentioning

confidence: 99%

A fast on-chip profiler memory

Lysecky

Cotterell

Vahid

2002

Proceedings 2002 Design Automation Conference (IEEE Cat. No.02CH37324)

View full text Add to dashboard Cite

Profiling an application executing on a microprocessor is part of the solution to numerous software and hardware optimization and design automation problems. Most current profiling techniques suffer from runtime overhead, inaccuracy, or slowness, and the traditional non-intrusive method of using a logic analyzer doesn't work for today's system-on-a-chip having embedded cores. We introduce a novel on-chip memory architecture that overcomes these limitations. The architecture, which we call ProMem, is based on a pipelined binary tree structure. It achieves single-cycle throughput, so it can keep up with today's fastest pipelined processors. It can also be laid out efficiently and scales very well, becoming more efficient the larger it gets. The memory can be used in a wide-variety of common profiling situations, such as instruction profiling, value profiling, and network traffic profiling, which in turn can be used to guide numerous design automation tasks.

show abstract

Energy and performance improvements in microprocessor design using a loop cache

Cited by 55 publications

References 3 publications

[2010] Energy Efficiency Using Loop Buffer based Instruction Memory Organizations

[2010] Energy Efficiency Using Loop Buffer based Instruction Memory Organizations

Compiler Directed Issue Queue Energy Reduction

A fast on-chip profiler memory

Contact Info

Product

Resources

About