Accelerating Large Sparse Neural Network Inference using GPU Task Graph Parallelism

Lin, Dian-Lun; Huang, Tsung‐Wei

doi:10.1109/tpds.2021.3138856

Cited by 5 publications

(1 citation statement)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At a particular VLSI timing analysis example, Heteroflow can reduce a baseline runtime from 99 minutes to 13 minutes (7.7× speed-up) on a machine of 40 CPU cores and 4 GPUs. Future work will focus on distributing our scheduler based on [46] and incorporating a broader range of workloads, including machine learning [47], [48] and engineering simulation [49], [50], [51].…”

Section: Discussionmentioning

confidence: 99%

Concurrent CPU-GPU Task Programming using Modern C++

Huang¹,

Lin²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of heterogeneous decomposition strategies. Our new CPU-GPU programming model allows users to express a problem in a way that adapts to effective separation of concerns and expertise encapsulation. Compared with existing libraries, Heteroflow is more cost-efficient in performance scaling, programming productivity, and solution generality. We have evaluated Heteroflow on two real applications in VLSI design automation and demonstrated the performance scalability across different CPU-GPU numbers and problem sizes. At a particular example of VLSI timing analysis with million-scale tasking, Heteroflow achieved 7.7× runtime speed-up (99 vs 13 minutes) over a baseline on a machine of 40 CPU cores and 4 GPUs.

show abstract