2022
DOI: 10.1016/j.parco.2021.102871
|View full text |Cite
|
Sign up to set email alerts
|

Using long vector extensions for MPI reductions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…Quicksort and Bitonic sort have also been optimized for the AVX-512 ISA [12], while stencil computations have been optimized for ARM SVE by Armejach et al [7]. Zhong et al [47] investigate the use of vector intrinsics in the context of MPI reductions, for both the AVX-512 and SVE ISAs.…”
Section: Related Workmentioning
confidence: 99%
“…Quicksort and Bitonic sort have also been optimized for the AVX-512 ISA [12], while stencil computations have been optimized for ARM SVE by Armejach et al [7]. Zhong et al [47] investigate the use of vector intrinsics in the context of MPI reductions, for both the AVX-512 and SVE ISAs.…”
Section: Related Workmentioning
confidence: 99%
“…First, the data transfer between the host and the device during small data copying in heterogeneous nodes is mainly through a direct assignment. By increasing the bandwidth of the entire point-to-point data copy and reducing the number of copies within the node, the CPU of X86 architecture increases the bandwidth of the data copy instruction by using the highbandwidth data copy instruction that can be implemented in the multi-vector access instruction support 14 , thus reducing the memory competition during the communication of large data volume, reducing cache miss, and improving the efficiency of the point-to-point communication function in the heterogeneous runtime system. As shown in Figure 4, we propose to choose whether to utilize multi-vector access instructions to optimize the copy between heterogeneous devices based on the heterogeneous node architecture.…”
Section: Multi-vector Copy Instruction Optimizationmentioning
confidence: 99%
“…one single CPU instruction can simultaneously process from two doubles (four floats) up to eight doubles (or sixteen floats), depending on the underlying architecture [49]. Vectorization can be easily applied to a parallel framework [50], which in our case results in the scheme shown in Figure 2.…”
Section: Matrix-free and Matrix-based Solversmentioning
confidence: 99%