A Framework for the Automatic Vectorization of Parallel Sort on x86-Based Processors

Hou, Kaixi; Wang, Hao; Feng, Wu-chun

doi:10.1109/tpds.2018.2789903

Cited by 20 publications

(15 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…83 Use of intrinsics is error-prone and increases code-development time. 88 Also, it requires in-depth understanding of both the algorithm and SIMD intrinsics. Further, since arbitrary data-movement is not feasible, multiple data-reordering functions may need to be used to arrange the data in desired order for computations.…”

Section: Need Of Code-rewritingmentioning

confidence: 99%

“…Further, since arbitrary data-movement is not feasible, multiple data-reordering functions may need to be used to arrange the data in desired order for computations. 88 Also, since the ISA and vector width supported by different processors is different, use of intrinsics may lead to non-portable code. 88 Some works propose writing individual versions of all functions for both Phi and CPU.…”

Section: Need Of Code-rewritingmentioning

confidence: 99%

See 1 more Smart Citation

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.

show abstract

Section: Need Of Code-rewritingmentioning

confidence: 99%

Section: Need Of Code-rewritingmentioning

confidence: 99%

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…For improving the efficiency of the TCP process, Zhang et al used faster mutation testing to both optimize test case sequence and reduce the number of test cases. With recent improvements of GPU computation techniques, parallel acceleration has been used in TCP . In the application extension of TCP techniques, Wang et al used three objectives, total time, resource usage, and density, to guide the evolution direction in the testing of videoconferencing systems.…”

Section: Related Workmentioning

confidence: 99%

Concrete hyperheuristic framework for test case prioritization

Bian

Zheng

Guo

et al. 2018

J Software Evolu Process

View full text Add to dashboard Cite

Test case prioritization (TCP), which aims to find the optimal test case execution sequences for specific testing objects, has been widely used in regression testing. A wide variety of search methodologies and algorithms have been proposed to optimize test case execution sequences, namely, search‐based TCP. However, different algorithms perform differently and have different implementation costs and specific situations where an algorithm usually performs with high effectiveness and efficiency. When facing a new testing scenario, it is actually difficult to decide which algorithm is suitable. In this paper, to address the algorithm selection problem for different test scenarios, a more generally applicable algorithm based on a hyperheuristic strategy is proposed for search‐based TCP. This includes a range of multiobjective algorithms with a variety of crossover strategies and a learning agent strategy to evaluate and select the appropriate algorithm execution sequence dynamically for different scenarios. The concrete hyperheuristic framework for multiobjective TCP is presented with an algorithm's repository in the low level and the learning agent strategy in the higher level. Experiments show that the proposed learning agent strategy can accurately evaluate algorithms in multiobjective problems and select the appropriate algorithm in each iteration.

show abstract

“…e performance gain is mainly from the irregularity of the row distribution of D C. e sorting kernel has received much aention due to the pervasive need to order data in a plethora of applications. It has been parallelized and optimized on x86-based architectures [9,22] and GPUs [26,33,38,38,40]. Several optimized sort implementations have been included in vendor supplied libraries, e.g., cuDPP [19], rust [20], ModernGPU [3], and CUB [37].…”

Section: Sux Array Constructionmentioning

confidence: 99%

Fast segmented sort on GPUs

Hou

Liu

Wang

et al. 2017

Proceedings of the International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

Segmented sort, as a generalization of classical sort, orders a batch of independent segments in a whole array. Along with the wider adoption of manycore processors for HPC and big data applications, segmented sort plays an increasingly important role than sort. In this paper, we present an adaptive segmented sort mechanism on GPUs. Our mechanisms include two core techniques: (1) a differentiated method for dierent segment lengths to eliminate the irregularity caused by various workloads and thread divergence; and (2) a register-based sort method to support N-to-M data-thread binding and in-register data communication. We also implement a shared memory-based merge method to support non-uniform length chunk merge via multiple warps. Our segmented sort mechanism shows great improvements over the methods from CUB, CUSP and ModernGPU on NVIDIA K80-Kepler and TitanX-Pascal GPUs. Furthermore, we apply our mechanism on two applications, i.e., sux array construction and sparse matrix-matrix multiplication, and obtain obvious gains over state-of-the-art implementations.

show abstract

A Framework for the Automatic Vectorization of Parallel Sort on x86-Based Processors

Cited by 20 publications

References 30 publications

A survey on evaluating and optimizing performance of Intel Xeon Phi

A survey on evaluating and optimizing performance of Intel Xeon Phi

Concrete hyperheuristic framework for test case prioritization

Fast segmented sort on GPUs

Contact Info

Product

Resources

About