Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2022
DOI: 10.1145/3503221.3508417
|View full text |Cite
|
Sign up to set email alerts
|

BaGuaLu

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2
2

Relationship

1
5

Authors

Journals

citations
Cited by 17 publications
(4 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…Since the throughput of FP16 on SW26010pro is nearly four times that of FP32, it is crucial to use FP16 to implement SWattention. Based on the mixed-precision training strategy in [12], we use FP16 for GEMM in SWattention and use FP32 for exp operator. However, the precision conversion from FP16 to FP32 and from FP32 to FP16 introduces additional overhead, preventing FP16 from achieving the expected speedup.…”
Section: Tiling Strategy and Mixed-precision Trainingmentioning
confidence: 99%
See 2 more Smart Citations
“…Since the throughput of FP16 on SW26010pro is nearly four times that of FP32, it is crucial to use FP16 to implement SWattention. Based on the mixed-precision training strategy in [12], we use FP16 for GEMM in SWattention and use FP32 for exp operator. However, the precision conversion from FP16 to FP32 and from FP32 to FP16 introduces additional overhead, preventing FP16 from achieving the expected speedup.…”
Section: Tiling Strategy and Mixed-precision Trainingmentioning
confidence: 99%
“…15, the DMA bandwidth is tested with loading matrix O from DRAM to LDM and storing matrix O from LDM to DRAM. Compared with the very limited memory bandwidth of the global load and store operations that can only reach 0.24 GB/s and 0.024 GB/s [12], the DMA load and store bandwidth are 211 GB/s and 122 GB/s, respectively. Therefore, DMA is well-suited for transferring contiguous memory blocks between DRAM and LDM, and the DMA bandwidth is close to the theoretical 307 GB/s bandwidth on the SW26010pro processor.…”
Section: Memory Accessmentioning
confidence: 99%
See 1 more Smart Citation
“…The hybrid structure of PR-MoE [21] improved the parameter efficiency by fixing one shared expert. BaGuaLu [15] re-distributed the data chunks evenly, damaging the model accuracy. However, almost all of these high-level algorithms are agnostic of the complicated underlying hardware effect on training performance.…”
Section: Related Workmentioning
confidence: 99%