Efficient code generation in a region-based dynamic binary translator

SpinkTom,; WagstaffHarry,; FrankeBjörn,; TophamNigel,

doi:10.1145/2666357.2597810

Cited by 3 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This reduction in simulation efficiency can be reduced by using a on-FPGA embedded or soft processor to run the software fall-back model rather than require the high-latency communication to the host computer. The 16-core simulation achieves simulation rates up to 62 MIPS, and even the theoretical 100 MIPS throughput of the BlueSPARC engine is significantly slower than modern JIT compiled software simulators (typically several hundred MIPS per core, with results up to 1323 MIPS per core achieved by the current state of the art [47]). …”

Section: Protoflexmentioning

confidence: 99%

High Speed Cycle-Approximate Simulation of Embedded Cache-Incoherent and Coherent Chip-Multiprocessors

Thompson

Gould

Topham

2018

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

The increasing density of silicon processes, coupled with the development of ever more energy and space efficient embedded core designs, has led to multi-processor system-on-chip (MPSoC) designs becoming increasingly attractive for use in embedded systems. Unfortunately this increase in core count gives rise to an explosion in design space possibilities, especially when heterogeneous designs are considered. To address this problem, new techniques in simulation are required to increase the simulation performance of these systems, while maintaining the accuracy needed to make good design decisions, and to verify the performance characteristics for realtime systems. We present a new high-speed, near cycle-accurate simulator, addressing an important but neglected category of multicore systems: deeply-embedded cacheincoherent MPSoCs. We take advantage of the unique properties of these systems to relax synchronisation constraints and increase the parallelism of the simulation. In doing so we achieve performance not possible using previous simulation techniques, without compromising the accuracy of the results. Quantitative performance results are presented across a large range of simulated MPSoC designs, comprising 1-64 cores, on average we simulate at 5.7 MIPS, with simulation speeds reaching 377 MIPS in the best case. Comparing against FPGA implementations we demonstrate that the simulator manages this with an average timing error of only 2.1%. Applying some of B Christopher Thompson

show abstract

Section: Protoflexmentioning

confidence: 99%

High Speed Cycle-Approximate Simulation of Embedded Cache-Incoherent and Coherent Chip-Multiprocessors

Thompson

Gould

Topham

2018

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

show abstract

“…This means that these benchmarks also measure the handling of self modifying code. Furthermore, when optimisations such as concurrent code generation [7] or region-based code generation [28] are applied, these benchmarks will help to measure the effectiveness of these techniques.…”

Section: B Benchmark Categoriesmentioning

confidence: 99%

“…Control flow can also be split into direct (where the branch target is known in advance, and is encoded as an absolute or relative path into the instruction) and indirect (where the branch target is read from memory or a register). A large amount of work has been done on optimising the various forms of control flow [28,20,15,17].…”

Section: B Benchmark Categoriesmentioning

confidence: 99%

See 1 more Smart Citation

SimBench: A portable benchmarking methodology for full-system simulators

Wagstaff

Bodin

Spink

et al. 2017

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Self Cite

View full text Add to dashboard Cite

Abstract-Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmark suites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on 'real-world' workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memory access performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks.

show abstract

“…Guest basic blocks are terminated at page boundaries for memory protection purposes. Normal control flow out of a block is optimized utilising techniques from Spink et al [2014], which includes directly chaining to other basic blocks that are part of the same memory page to avoid costly returns to the main execution loop. If a translation does not exist, or the destination does not live on the same page, then control is returned to the main execution loop, which will then handle the situation accordingly.…”

Section: Cpu Virtualizationmentioning

confidence: 99%

Hardware-Accelerated Cross-Architecture Full-System Virtualization

Spink

Wagstaff

Franke

2016

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported by all major processor architectures, including Intel, ARM, PowerPC, and MIPS, these extensions are targeted at virtualization of the same architecture, for example, an x86 guest on an x86 host system. Existing techniques for cross-architecture virtualization, for example, an ARM guest on an x86 host, still incur a substantial overhead for CPU, memory, and I/O virtualization due to the necessity for software emulation of these mismatched system components. In this article, we present a new hardwareaccelerated hypervisor called CAPTIVE, employing a range of novel techniques that exploit existing hardware virtualization extensions for improving the performance of full-system cross-platform virtualization. We illustrate how (1) guest memory management unit (MMU) events and operations can be mapped onto host memory virtualization extensions, eliminating the need for costly software MMU emulation, (2) a block-based dynamic binary translation engine inside the virtual machine can improve CPU virtualization performance, (3) memory-mapped guest I/O can be efficiently translated to fast I/O specific calls to emulated devices, and (4) the cost for asynchronous guest interrupts can be reduced. For an ARM-based Linux guest system running on an x86 host with Intel VT support, we demonstrate application performance levels, based on SPEC CPU2006 benchmarks, of up to 5.88× over state-of-the-art QEMU and 2.5× on average, achieving a guest dynamic instruction throughput of up to 1280 MIPS (million instructions per second) and 915.52 MIPS, on average.

show abstract

Efficient code generation in a region-based dynamic binary translator

Cited by 3 publications

References 29 publications

High Speed Cycle-Approximate Simulation of Embedded Cache-Incoherent and Coherent Chip-Multiprocessors

High Speed Cycle-Approximate Simulation of Embedded Cache-Incoherent and Coherent Chip-Multiprocessors

SimBench: A portable benchmarking methodology for full-system simulators

Hardware-Accelerated Cross-Architecture Full-System Virtualization

Contact Info

Product

Resources

About