2018
DOI: 10.1016/j.parco.2018.06.001
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(8 citation statements)
references
References 4 publications
0
8
0
Order By: Relevance
“…According to a Stream benchmark test, when being accessed by Gload/Gstore instructions, the Copy, Scale, Add, and Triad maximum bandwidths are only 3.88 GB/s, 1.61 GB/s, 1.45 GB/s, and 1.48 GB/s, respectively. Correspondingly, when using DMA PE mode, the maximum Copy bandwidth reaches 27.9 GB/s, the maximum Scale bandwidth is 24.1 GB/s, the Add bandwidth is 23.4 GB/s, and the Triad bandwidth is 22.6 GB/s [22]. According to the above data, the DMA prefers transferring massive data from the main memory to the SPM of the CPE, and Gload/Gstore prefers transferring small and random data between the main memory and the SPM.…”
Section: Sw26010 Processor Architecture and Analysismentioning
confidence: 96%
See 1 more Smart Citation
“…According to a Stream benchmark test, when being accessed by Gload/Gstore instructions, the Copy, Scale, Add, and Triad maximum bandwidths are only 3.88 GB/s, 1.61 GB/s, 1.45 GB/s, and 1.48 GB/s, respectively. Correspondingly, when using DMA PE mode, the maximum Copy bandwidth reaches 27.9 GB/s, the maximum Scale bandwidth is 24.1 GB/s, the Add bandwidth is 23.4 GB/s, and the Triad bandwidth is 22.6 GB/s [22]. According to the above data, the DMA prefers transferring massive data from the main memory to the SPM of the CPE, and Gload/Gstore prefers transferring small and random data between the main memory and the SPM.…”
Section: Sw26010 Processor Architecture and Analysismentioning
confidence: 96%
“…The absolute latency of the unaligned access was even higher. In addition, the latency of the vectorization operation, which includes the arithmetic and permutation operations [22], is listed in Table 4. The instruction prefix 'v' stands for the vector operations.…”
Section: Vectorizationmentioning
confidence: 99%
“…Each CPE has an in-order dual-issue pipeline (pipeline 0 or pipeline 1) that allows the 4-wide SIMD floating point instructions to co-issue with the data motion instructions in the same cycles [28] . It can execute two instructions per cycle, one on pipeline 0 and the other on pipeline 1.…”
Section: Instruction Pipelinesmentioning
confidence: 99%
“…The SW26010 processor [8], [12] is a heterogeneous manycore architecture that uses distributed shared storage and on-chip computing array. As illustrated on the left side of Fig.…”
Section: B Sw26010 Processor Architecturementioning
confidence: 99%
“…The system can achieve 74% of the theoretical performance (93 PFlops) when running LINKPACK [9]. As the main contributor to the computational power of the Sunway TaihuLight, SW26010 has several special archi-tectural features [10]- [12], such as an 8 × 8 CPE (computing processing element) cluster, software-controlled memory hierarchy, hardware-supported register communication, and CPE double-pipeline instruction execution, all of which have great potential for implementing matrix multiplication.…”
Section: Introductionmentioning
confidence: 99%