1.INTRODUCTION Branch predictors are one of the key units in the front-end of modern high-performance microprocessors. They detect branches and predict the branch target address and the branch outcome in the early pipeline stages, thus reducing the number of wasted clock cycles due to control hazards. The target of a direct branch is predicted using a branch target buffer (BTB) [1] -a cache structure indexed by a portion of the branch address. Each BTB entry typically includes the tag field, the offset field, the branch type field (e.g., direct/indirect, unconditional/conditional), the valid bit, the replacement bits for multi-way BTBs, and the target address. A separate hardware structure named an indirect branch target buffer (iBTB) can be employed to handle indirect branches with multiple target addresses [2][3][4]. The branch outcome predictors have evolved from a simple linear branch history table (BHT) with 2-bit saturating counters (2bc) [5] to very sophisticated branch predictor structures found in recent commercial microprocessors [6][7][8][9]. A number of advanced predictor structures have been proposed, including (i) twolevel adaptive predictors that exploit global or local branch histories of branch outcomes to achieve a better mapping into the BHT [10,11] (ii) de-interference predictors which reduce negative effects of branch interference [12][13][14][15], (iii) hybrid predictors that include multiple specialized structures [16][17][18], and (iv) perceptron predictors [19,20].Code optimizations based on the information about branch predictor structures can greatly increase overall program performance [21,22]. For example, if the compiler is aware of the BTB size and organization, it can prevent branch interference in critical portions of the code by re-aligning the branch instructions. Next, if the compiler is aware of local and global branch history lengths, it can employ code duplication or loop unrolling transformations to alleviate mispredictions [21]. Jimenez introduced the Camino C compiler [22] that exploits knowledge about branch predictor internal structures. It performs feedback-directed code placement to reduce the number of branch mispredictions in the NetBurst architecture. This optimization reduces the number of branch mispredictions in the SPEC CPU2K benchmarks in the range of 22% to 3.5%.Unfortunately, microprocessor manufacturers rarely fully disclose information about the branch predictor organization thus preventing efforts aimed at better code optimization. This problem can be addressed by employing reverse engineering techniques aimed at branch predictor units. A prior reverse engineering flow focusing on P6 and NetBurst architectures [21] has been successful in determining the size and organization of the BTB and the presence and lengths of global and local histories. However, this flow does not include any experiments for determining the organization of predictor structures indexed by program path information nor their internal operation. In addition, it does not incl...
This paper introduces a new unobtrusive and cost-effective method for the capture and compression of program execution traces in real-time, which is based on a double move-to-front transformation. We explore its effectiveness and describe a costeffective hardware implementation. The proposed trace compressor requires only 0.12 bits per instruction of trace port bandwidth, at the cost of 25K gates.
Abstract-Unobtrusive capturing of program execution traces in real-time is crucial for debugging many embedded systems. However, tracing even limited program segments is often cost-prohibitive, requiring wide trace ports and large on-chip trace buffers. This paper introduces a new cost-effective technique for capturing and compressing program execution traces on-thefly. It relies on branch predictor-like structures in the trace module and corresponding software modules in the debugger to significantly reduce the number of events that need to be streamed out of the target system. Coupled with an effective variable encoding scheme that adapts to changing program patterns, our technique requires merely 0.029 bits per instruction of trace port bandwidth, providing a 34-fold improvement over the commercial state-of-the-art and a five-fold improvement over academic proposals, at the low cost of under 5,000 logic gates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.