SIMD defragmenter

Park, Yongjun; Park, Hyunchul; Cho, Hyoun Kyu; Mahlke, Scott

doi:10.1145/2150976.2151014

Cited by 22 publications

(5 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SIMD extensions have been widely used in desktop for multimedia applications [1]. SIMD extensions offer high performance, high power consumption, and portability and are also suitable for mobile systems [2][3][4].…”

Section: Introductionmentioning

confidence: 99%

“…VeGen [21] implemented a compilation framework that uses non-SIMD instructions to realize automatic vectorization of nonisomorphic statements. Methods based on hardware special instructions are generally limited by the processor platform and introduce additional operating costs (2) The nonisomorphic statement vectorization method based on expression equivalence transformation mainly uses expression equivalence transformation to convert nonisomorphic statements that satisfy certain conditions into isomorphic statements, thereby creating conditions for the implementation of SLP. For example, the LSLP method [19] analyzes and processes multiple nonisomorphic statements with differences in the order of operations and rearranges the commutative operations and operands based on the commutative law when the conditions are suitable to obtain isomorphic statements.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An SLP Vectorization Method Based on Equivalent Extended Transformation

Feng

Tao

et al. 2022

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

SIMD extensions provide an efficient energy consumption platform to support mobile systems. How to use SIMD instructions to improve program performance is a challenge. SLP (superword level parallelism) is an efficient solution to exploit the parallelism, oriented to SIMD, between statements in the basic blocks, and it has been widely used in almost all the mainstream compilers. SLP relies on finding isomorphic statements to pack together into vectors. However, the capability of autovectorization for nonisomorphic statements is insufficient. In this paper, we introduce SLP-E, a novel autovectorization method that can automatically vectorize the codes which contain nonisomorphic statements, translate the nonisomorphic statements into the isomorphic statements by equivalent extended transformation of expressions, and vectorize the isomorphic statements. SLP-E improves the application scope and benefits of SLP. We implement the SLP-E in LLVM and compare it with prior approaches. A set of applications that benefit from autovectorization are taken from the SPEC CPU 2017 benchmark to compare our approach and prior techniques. Experimental results show that SLP-E achieves more than 43.9% speedup, on average, over other similar methods.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An SLP Vectorization Method Based on Equivalent Extended Transformation

Feng

Tao

et al. 2022

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

show abstract

“…To accelerate applications efficiently using vector units, a compiler or programmer should find a substantial amount of underlying data parallelism and translate the parallelization potential into a real code to make sufficient use of the vector unit. Although many techniques for improving the quality of the vector code have been proposed [4][5][6][7], the resulting vector resource utilization is still low. Manual vector code optimization is a basic approach; however, it requires a deep understanding of the target vector architectures, and the optimized codes have limited reusability.…”

Section: Introductionmentioning

confidence: 99%

“…Automatic compiler-level vectorization is a promising alternative to manual vector code generation, but it cannot provide sufficient coverage because it can vectorize only 45-71% of loops, even in synthetic benchmarks [8]. Moreover, many vectorized applications do not show sufficient performance gains, as expected, owing to the high data alignment overhead [4,7]. Although many vectorization libraries also utilize vector units by providing more general interfaces, they are still limited in use.…”

Section: Introductionmentioning

confidence: 99%

A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

Son

Kang

et al. 2021

Electronics

Self Cite

View full text Add to dashboard Cite

Most modern processors contain a vector accelerator or internal vector units for the fast computation of large target workloads. However, accelerating applications using vector units is difficult because the underlying data parallelism should be uncovered explicitly using vector-specific instructions. Therefore, vector units are often underutilized or remain idle because of the challenges faced in vector code generation. To solve this underutilization problem of existing vector units, we propose the Vector Offloader for executing scalar programs, which considers the vector unit as a scalar operation unit. By using vector masking, an appropriate partition of the vector unit can be utilized to support scalar instructions. To efficiently utilize all execution units, including the vector unit, the Vector Offloader suggests running the target applications concurrently in both the central processing unit (CPU) and the decoupled vector units, by offloading some parts of the program to the vector unit. Furthermore, a profile-guided optimization technique is employed to determine the optimal offloading ratio for balancing the load between the CPU and the vector unit. We implemented the Vector Offloader on a RISC-V infrastructure with a Hwacha vector unit, and evaluated its performance using a Polybench benchmark set. Experimental results showed that the proposed technique achieved performance improvements up to 1.31× better than the simple, CPU-only execution on a field programmable gate array (FPGA)-level evaluation.

show abstract

“…However, vectorization is often much less effective for applications which have low trip count loops, complex control flow, and non-uniform execution behavior [7]. As a result, SIMD lanes remain idle due to insufficient DLP [8]. SIMD widths have been following an upward trend: the 128-bit Streaming SIMD Extensions (SSE) of x86 architectures has been augmented by 256-bit Advanced Vector Extensions (AVX); the new Intel Many Integrated Core (MIC) architecture supports 512-bit SIMD.…”

Section: Introductionmentioning

confidence: 99%

Insufficient Vectorization: A New Method to Exploit Superword Level Parallelism

Gao¹,

Lin²,

Zhao³

et al. 2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYSingle-instruction multiple-data (SIMD) extension provides an energy-efficient platform to scale the performance of media and scientific applications while still retaining post-programmability. However, the major challenge is to translate the parallel resources of the SIMD hardware into real application performance. Currently, all the slots in the vector register are used when compilers exploit SIMD parallelism of programs, which can be called sufficient vectorization. Sufficient vectorization means all the data in the vector register is valid. Because all the slots which vector register provides must be used, the chances of vectorizing programs with low SIMD parallelism are abandoned by sufficient vectorization method. In addition, the speedup obtained by full use of vector register sometimes is not as great as that obtained by partial use. Specifically, the length of vector register provided by SIMD extension becomes longer, sufficient vectorization method cannot exploit the SIMD parallelism of programs completely. Therefore, insufficient vectorization method is proposed, which refer to partial use of vector register. First, the adaptation scene of insufficient vectorization is analyzed. Second, the methods of computing inter-iteration and intra-iteration SIMD parallelism for loops are put forward. Furthermore, according to the relationship between the parallelism and vector factor a method is established to make the choice of vectorization method, in order to vectorize programs as well as possible. Finally, code generation strategy for insufficient vectorization is presented. Benchmark test results show that insufficient vectorization method vectorized more programs than sufficient vectorization method by 107.5% and the performance achieved by insufficient vectorization method is 12.1% higher than that achieved by sufficient vectorization method. key words: SIMD extension, SIMD parallelism, vector register, insufficient vectorization

show abstract

SIMD defragmenter

Cited by 22 publications

References 26 publications

An SLP Vectorization Method Based on Equivalent Extended Transformation

An SLP Vectorization Method Based on Equivalent Extended Transformation

A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

Insufficient Vectorization: A New Method to Exploit Superword Level Parallelism

Contact Info

Product

Resources

About