Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Liu, Yuping; Hong, Ding-Yong; Wu, Jan‐Jan; Fu, Sheng-Yu; Hsu, Wei-Chung

doi:10.1145/3301488

Cited by 5 publications

(6 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This dynamic technique enables short-SIMD binaries portability across newer, wider SIMD generations [151]. Spill-aware superword level parallelism (saSLP) [152] exploits the x86 AVX2 host's parallelism, gathers instructions, and registers capacity. To support that, it combines short ARMv8 instructions and registers in the guest binaries.…”

Section: ) Instruction-level Approachmentioning

confidence: 99%

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

Mustafa,

Alkhasawneh,

Obeidat

et al. 2024

IEEE Access

View full text Add to dashboard Cite

The Single Instruction Multiple Data (SIMD) architecture, supported by various highperformance computing platforms, efficiently utilizes data-level parallelism. The SIMD model is used in traditional CPUs, dedicated vector systems, and accelerators such as GPUs, vector extensions, and Xeon Phi. It provides performance throughput in computation-intensive and data-parallel applications. Despite the similarity of data-processing principles between these architectures, porting various programming models between the reviewed platforms is challenging. Furthermore, enhancing the programmability of these architectures is an important feature for utilizing their emerging computing power and simplifying programming complexity. This paper reviews the basic principles of optimization techniques to run asynchronous Multiple Instruction Multiple Data (MIMD) on SIMD accelerators. It also surveys several GPU programming paradigms and application programming interfaces (APIs) and classifies these frameworks into different groups based on their criteria. In addition, a review of studies that performed a comparison of the collaborative execution of GPUs with CPUs and Xeon Phi is presented in this paper. This study will be beneficial for developers and researchers in the field of computer architecture and parallel computing of intensive scientific applications, specifically for early-stage high-performance computing researchers, to obtain a brief overview of performance optimization opportunities as well as the challenges of existing SIMD platforms.

show abstract

Section: ) Instruction-level Approachmentioning

confidence: 99%

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

Mustafa,

Alkhasawneh,

Obeidat

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Vectorization methods for loops include SLP-oriented loop unrolling optimization [33][34][35], selection optimization of vector methods based on program parallelism features [36], and vector recognition optimization based on directed graph reachability [37]. In addition, SLP methods have also been applied in the fields of dynamic code conversion [38] and optimization of vector code in inline assembly form [39], etc.…”

Section: Wireless Communications and Mobile Computingmentioning

confidence: 99%

An SLP Vectorization Method Based on Equivalent Extended Transformation

Feng

Tao

et al. 2022

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

SIMD extensions provide an efficient energy consumption platform to support mobile systems. How to use SIMD instructions to improve program performance is a challenge. SLP (superword level parallelism) is an efficient solution to exploit the parallelism, oriented to SIMD, between statements in the basic blocks, and it has been widely used in almost all the mainstream compilers. SLP relies on finding isomorphic statements to pack together into vectors. However, the capability of autovectorization for nonisomorphic statements is insufficient. In this paper, we introduce SLP-E, a novel autovectorization method that can automatically vectorize the codes which contain nonisomorphic statements, translate the nonisomorphic statements into the isomorphic statements by equivalent extended transformation of expressions, and vectorize the isomorphic statements. SLP-E improves the application scope and benefits of SLP. We implement the SLP-E in LLVM and compare it with prior approaches. A set of applications that benefit from autovectorization are taken from the SPEC CPU 2017 benchmark to compare our approach and prior techniques. Experimental results show that SLP-E achieves more than 43.9% speedup, on average, over other similar methods.

show abstract

“…Figure 4 shows the workflow for binary lifting, including the three different IRs used throughout this process, namely, MCInst, MachineInstr, and finally the LLVM IR. By lifting the binary to the LLVM IR, we are able to re-optimize the program, enabling us to exploit features that are specific of the target ISA [40,46] or focus on a different objective function such as code-size reduction [20, 57ś60]. First, the source binary is disassembled to an array of MCInst, which is the lowest-level IR in LLVM, working as an in-memory representation of the disassembled binary code.…”

Section: Binary Liftingmentioning

confidence: 99%

Lasagne: a static binary translator for weak memory model architectures

Rocha

Sprokholt

Fink³

et al. 2022

Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

The emergence of new architectures create a recurring challenge to ensure that existing programs still work on them. Manually porting legacy code is often impractical. Static binary translation (SBT) is a process where a program's binary is automatically translated from one architecture to another, while preserving their original semantics. However, these SBT tools have limited support to various advanced architectural features. Importantly, they are currently unable to translate concurrent binaries. The main challenge arises from the mismatches of the memory consistency model specified by the different architectures, especially when porting existing binaries to a weak memory model architecture.In this paper, we propose Lasagne, an end-to-end static binary translator with precise translation rules between x86 and Arm concurrency semantics. First, we propose a concurrency model for Lasagne's intermediate representation (IR) and formally proved mappings between the IR and the two architectures. The memory ordering is preserved by introducing fences in the translated code. Finally, we propose optimizations focused on raising the level of abstraction of memory address calculations and reducing the number of fences. Our evaluation shows that Lasagne reduces the number of fences by up to about 65%, with an average reduction of 45.5%, significantly reducing their runtime overhead.

show abstract

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Cited by 5 publications

References 39 publications

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

An SLP Vectorization Method Based on Equivalent Extended Transformation

Lasagne: a static binary translator for weak memory model architectures

Contact Info

Product

Resources

About