Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations

Lin, James; Xu, Zhigeng; Cai, Linjin; Nukada, Akira; Matsuoka, Satoshi

doi:10.1016/j.parco.2018.06.001

Cited by 16 publications

(8 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…According to a Stream benchmark test, when being accessed by Gload/Gstore instructions, the Copy, Scale, Add, and Triad maximum bandwidths are only 3.88 GB/s, 1.61 GB/s, 1.45 GB/s, and 1.48 GB/s, respectively. Correspondingly, when using DMA PE mode, the maximum Copy bandwidth reaches 27.9 GB/s, the maximum Scale bandwidth is 24.1 GB/s, the Add bandwidth is 23.4 GB/s, and the Triad bandwidth is 22.6 GB/s [22]. According to the above data, the DMA prefers transferring massive data from the main memory to the SPM of the CPE, and Gload/Gstore prefers transferring small and random data between the main memory and the SPM.…”

Section: Sw26010 Processor Architecture and Analysismentioning

confidence: 96%

See 1 more Smart Citation

swHPFM: Refactoring and Optimizing the Structured Grid Fluid Mechanical Algorithm on the Sunway TaihuLight Supercomputer

Zhang

Zhou

et al. 2019

Applied Sciences

View full text Add to dashboard Cite

Fluid mechanical simulation is a typical high-performance computing problem. Due to the development of high-precision parallel algorithms, traditional computing platforms are unable to satisfy the computing requirements of large-scale algorithms. The Sunway TaihuLight supercomputer, which uses the SW26010 processor as its computing node, provides a powerful computing performance for this purpose. In this paper, the Sunway hierarchical parallel fluid machinery (swHPFM) framework and algorithm are proposed. Using the proposed framework and algorithm, engineers can exploit the parallelism of the existing fluid mechanical algorithm and achieve a satisfactory performance on the Sunway TaihuLight. In the framework, a suitable mapping of the model and the system architecture is developed, and the computing power of the SW26010 processor is fully utilized via the scratch pad memory (SPM) access strategy and serpentine register communication. In addition, the framework is implemented and tested by the axial compressor rotor simulation algorithm on a real-world dataset with Sunway many-core processors. The results demonstrate that we can achieve a speedup of up to 8.2×, compared to the original ported version, which only uses management processing elements (MPEs), as well as a 1.3× speedup compared to an Intel Xeon E5 processor. The proposed framework is useful for the optimization of fluid mechanical algorithm programs on computing platforms with a heterogeneous many-core architecture.

show abstract

Section: Sw26010 Processor Architecture and Analysismentioning

confidence: 96%

“…The absolute latency of the unaligned access was even higher. In addition, the latency of the vectorization operation, which includes the arithmetic and permutation operations [22], is listed in Table 4. The instruction prefix 'v' stands for the vector operations.…”

Section: Vectorizationmentioning

confidence: 99%

swHPFM: Refactoring and Optimizing the Structured Grid Fluid Mechanical Algorithm on the Sunway TaihuLight Supercomputer

Zhang

Zhou

et al. 2019

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Each CPE has an in-order dual-issue pipeline (pipeline 0 or pipeline 1) that allows the 4-wide SIMD floating point instructions to co-issue with the data motion instructions in the same cycles [28] . It can execute two instructions per cycle, one on pipeline 0 and the other on pipeline 1.…”

Section: Instruction Pipelinesmentioning

confidence: 99%

Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture

Chen

Han

et al. 2021

J. Comput. Sci. Technol.

View full text Add to dashboard Cite

The short-range pair interaction consumes most of the CPU time in molecular dynamics (MD) simulations. The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core architecture. In this paper, we present a highly efficient short-range force kernel on the Sunway, a novel many-core architecture with many unique features. The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts. To enhance the data locality, we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores. In the absence of a low overhead locking mechanism, using data-privatization force array is a more feasible method to avoid write conflicts, but results in the large overhead of data reduction. We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks, which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing. Moreover, we exploit the single instruction multiple data (SIMD) parallelism and perform instruction reordering of the force kernel on this many-core processor. The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20% of peak flop rate on the Sunway many-core processor.

show abstract

“…The SW26010 processor [8], [12] is a heterogeneous manycore architecture that uses distributed shared storage and on-chip computing array. As illustrated on the left side of Fig.…”

Section: B Sw26010 Processor Architecturementioning

confidence: 99%

“…The system can achieve 74% of the theoretical performance (93 PFlops) when running LINKPACK [9]. As the main contributor to the computational power of the Sunway TaihuLight, SW26010 has several special archi-tectural features [10]- [12], such as an 8 × 8 CPE (computing processing element) cluster, software-controlled memory hierarchy, hardware-supported register communication, and CPE double-pipeline instruction execution, all of which have great potential for implementing matrix multiplication.…”

Section: Introductionmentioning

confidence: 99%

Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor

Chi

et al. 2020

IEEE Access

View full text Add to dashboard Cite

The study of matrix multiplication on the emerging SW26010 processor is highly significant for many scientific and engineering applications. The state-of-the-art work from the swBLAS library, called SWMM, focuses mainly on the infrequent case involving special matrix dimensions and determines the execution action of matrix multiplication by one specified algorithm. To further adapt to various matrix shapes, in this article, we present a runtime adaptive matrix multiplication methodology, called RTAMM, which targets the features of the SW26010 architecture. The execution action of RTAMM is determined dynamically at runtime via several fundamental cost formulas and multiple sets of blocking factors, rather than determining the action at library generation time. With comprehensive trade-offs between the computation and data access, overall architecture-oriented optimization methods are introduced at three levels (macro, assistant, and micro) to fully exploit the computing capability of SW26010. The experiments show that RTAMM can achieve competitive peak performance compared with SWMM. Moreover, in tests on 6000 different matrix multiplication cases, RTAMM outperforms SWMM in 85.55% of the cases, and the improvements range from 5% to 308%, whereas RTAMM is slightly inferior to SWMM in only 1.28% of the cases. These results demonstrate that RTAMM has both great adaptability and considerable performance improvement.

show abstract

Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations

Cited by 16 publications

References 4 publications

swHPFM: Refactoring and Optimizing the Structured Grid Fluid Mechanical Algorithm on the Sunway TaihuLight Supercomputer

swHPFM: Refactoring and Optimizing the Structured Grid Fluid Mechanical Algorithm on the Sunway TaihuLight Supercomputer

Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture

Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor

Contact Info

Product

Resources

About