BaGuaLu

Ma, Zixuan; He, Jiaao; Qiu, Jiezhong; Cao, Huanqi; Wang, Yuanwei; Sun, Zhenbo; Zheng, Liyan; Wang, Haojie; Tang, Shizhi; Zheng, Tianyu; Lin, Junyang; Feng, Guanyu; Huang, Zeqiang; Gao, Jie; Zeng, Aohan; Zhang, Jianwei; Zhong, Runxin; Shi, Tianhui; Li, Sha; Zheng, Weimin; Tang, Jie; Yang, Hongxia; Liu, Xin; Zhai, Jidong; Chen, Wenguang

doi:10.1145/3503221.3508417

Cited by 17 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the throughput of FP16 on SW26010pro is nearly four times that of FP32, it is crucial to use FP16 to implement SWattention. Based on the mixed-precision training strategy in [12], we use FP16 for GEMM in SWattention and use FP32 for exp operator. However, the precision conversion from FP16 to FP32 and from FP32 to FP16 introduces additional overhead, preventing FP16 from achieving the expected speedup.…”

Section: Tiling Strategy and Mixed-precision Trainingmentioning

confidence: 99%

“…15, the DMA bandwidth is tested with loading matrix O from DRAM to LDM and storing matrix O from LDM to DRAM. Compared with the very limited memory bandwidth of the global load and store operations that can only reach 0.24 GB/s and 0.024 GB/s [12], the DMA load and store bandwidth are 211 GB/s and 122 GB/s, respectively. Therefore, DMA is well-suited for transferring contiguous memory blocks between DRAM and LDM, and the DMA bandwidth is close to the theoretical 307 GB/s bandwidth on the SW26010pro processor.…”

Section: Memory Accessmentioning

confidence: 99%

“…Recently, the new generation Sunway Supercomputer that consists of numerous SW26010pro processors has shown great potential in supporting AI-based workloads [10][11][12][13]. However, it is non-trivial to apply the FlashAttention algorithm on the new generation Sunway Supercomputer.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer

Wu,

Zhu,

Chen

et al. 2024

J Supercomput

Self Cite

View full text Add to dashboard Cite

In the past few years, Transformer-based large language models (LLM) have become the dominant technology in a series of applications. To scale up the sequence length of the Transformer, FlashAttention is proposed to compute exact attention with reduced memory requirements and faster execution. However, implementing the FlashAttention algorithm on the new generation Sunway Supercomputer faces many constraints such as the unique heterogeneous architecture and the limited memory bandwidth. This work proposes SWattention, a highly efficient method for computing the exact attention on the SW26010pro processor. To fully utilize the 6 core groups (CG) and 64 cores per CG on the processor, we design a two-level parallel task partition strategy. Asynchronous memory access is employed to ensure that memory access overlaps with computation. Additionally, a tiling strategy is introduced to determine optimal SRAM block sizes. Compared with the standard attention, SWattention achieves around 2.0x speedup for FP32 training and 2.5x speedup for mixed-precision training. The sequence lengths range from 1k to 8k and scale up to 16k without being out of memory. As for the end-to-end performance, SWattention achieves up to 1.26x speedup for training GPT-style models, which demonstrates that SWattention enables longer sequence length for LLM training.

show abstract

Section: Tiling Strategy and Mixed-precision Trainingmentioning

confidence: 99%

Section: Memory Accessmentioning

confidence: 99%

See 1 more Smart Citation

SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer

Wu,

Zhu,

Chen

et al. 2024

J Supercomput

Self Cite

View full text Add to dashboard Cite

show abstract

“…The hybrid structure of PR-MoE [21] improved the parameter efficiency by fixing one shared expert. BaGuaLu [15] re-distributed the data chunks evenly, damaging the model accuracy. However, almost all of these high-level algorithms are agnostic of the complicated underlying hardware effect on training performance.…”

Section: Related Workmentioning

confidence: 99%

TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training

Chen¹,

Li²,

Wu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in scaling up deep neural networks to an extreme scale. Despite that numerous efforts have been made to improve the performance of MoE from the model design or system optimization perspective, existing MoE dispatch patterns are still not able to fully exploit the underlying heterogeneous network environments. In this paper, we propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging, from a model-system co-design perspective, which can dynamically adjust the MoE dispatch pattern according to the network topology. Based on communication modeling, we abstract the dispatch problem into an optimization objective and obtain the approximate dispatch pattern under different topologies. On top of that, we design a topology-aware auxiliary loss, which can adaptively route the data to fit in the underlying topology without sacrificing the model accuracy. Experiments show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations, with roughly 1.01x-1.61x, 1.01x-4.77x, 1.25x-1.54x improvements over the popular DeepSpeed-MoE, FastMoE and FasterMoE systems.

show abstract