Multi-role SpTRSV on Sunway Many-Core Architecture

Li, Mingzhen; Liu, Yi; Yang, Hailong; Luan, Zhongzhi; Qian, Depei

doi:10.1109/hpcc/smartcity/dss.2018.00109

Cited by 12 publications

(10 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each CDP, it enumerates the sample-NMO velocity pairs (line 2), and then nds the intersection of the traveltime curve and traces. At each intersection, it rst obtains the halfpoint of the current trace (line 9-11), then accesses the data with size of w (line 12-13), and nally retrieves the data computed in a window of width w (line [14][15][16][17][18][19]. Each trace has its own corresponding halfpoints, therefore the accesses to halfpoints are continuous when walking through the traces sequentially.…”

Section: Improving Parallelism Within a Cgmentioning

confidence: 99%

“…A er initialization, the traces in a CDP are processed in sequence (line 9) and the data halfpoints is prefetched before a new trace is processed (line 10-12). For the current trace, the memory addresses of the data accesses are calculated for each sample-NMO velocity pair and kept in the k1 array (line [13][14][15]. en, the maximum and minimum memory address in k1 array is identi ed (line [16][17][18] and used to determine the memory range (len th) of data accesses (line 19).…”

Section: 32mentioning

confidence: 99%

Section: Re-designing the Computation Formentioning

confidence: 99%

“…For instance, Liu et al [16] implement the e cient Sparse Matrix-Vector Multiplication (SpMV) on Sunway, which uses register communication to implement a complex communication mechanism, and thus achieves e cient mapping of SpMV algorithm to the hardware resources. Li et al [14] implement an e cient multi-role based SpTRSV algorithm on Sunway. It leverages the unique register communication mechanism to address memory bandwidth limitations.…”

Section: Performance Optimization On Sunwaymentioning

confidence: 99%

See 3 more Smart Citations

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

Yang

Luan

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Common Midpoint (CMP) and Common Re ection Surface (CRS) are widely used methods for improving the signal-to-noise ratio in the eld of seismic processing. ese methods are computationally intensive and require high performance computing. is paper optimizes these methods on the Sunway many-core architecture and implements large-scale seismic processing on the Sunway Taihulight supercomputer. We propose the following three optimization techniques: 1) we propose a so ware cache method to reduce the overhead of memory accesses, and share data among CPEs via the register communication; 2) we re-design the semblance calculation procedure to further reduce the overhead of memory accesses; 3) we propose a vectorization method to improve the performance when processing the small volume of data within short loops. e experimental results show that our implementations of CMP and CRS methods on Sunway achieve 3.50× and 3.01× speedup on average compared to the-state-of-the-art implementations on CPU. In addition, our implementation is capable to run on more than one million cores of Sunway TaihuLight with good scalability.

show abstract

Section: Improving Parallelism Within a Cgmentioning

confidence: 99%

Section: 32mentioning

confidence: 99%

Section: Re-designing the Computation Formentioning

confidence: 99%

Section: Performance Optimization On Sunwaymentioning

confidence: 99%

See 2 more Smart Citations

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

Yang

Luan

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The second challenge is to optimize the generated code regarding the unique architecture features of Sunway. Observed by existing research works [15][16][17], the key to achieve high performance on Sunway is to 1) fully utilize the computing resources of CPEs for massive parallelism, and 2) leverage the LDM of each CPE to alleviate the bottleneck of memory access. Therefore, when the neural network compiler optimizes the generated code, the following three rules need to be complied: 1) use the DMA as much as possible when accessing main memory.…”

Section: Challenges For DL Compilation On Sunwaymentioning

confidence: 99%

swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture

Li¹,

Liu²,

Liao³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the exiting deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific and deep learning applications. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient code for deep learning application on Sunway. The experimental results show the ability of swTVM to automatically generate code for various deep neural network models on Sunway. The performance of automatically generated code for AlexNet and VGG-19 by swTVM achieves 6.71× and 2.45× speedup on average than handoptimized OpenACC implementations on convolution and fully connected layers respectively. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and high performance architecture particularly with productivity and efficiency in mind. We would like to open source the implementation so that more people can embrace the power of deep learning compiler and Sunway many-core processor.

show abstract

swGBDT: Efficient Gradient Boosted Decision Tree on Sunway Many-Core Processor

Yin

Dun

et al. 2020

Supercomputing Frontiers

View full text Add to dashboard Cite

Multi-role SpTRSV on Sunway Many-Core Architecture

Cited by 12 publications

References 25 publications

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture

swGBDT: Efficient Gradient Boosted Decision Tree on Sunway Many-Core Processor

Contact Info

Product

Resources

About