HitGraph: High-throughput Graph Processing Framework on FPGA

Zhou, Shijie; Kannan, Rajgopal; Prasanna, Viktor K.; Seetharaman, Guna; Wu, Qing

doi:10.1109/tpds.2019.2910068

Cited by 81 publications

(66 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Bellman-Ford Algorithm Accelerators. HitGraph [58] and its earlier version [57] implement an edge-centric graph accelerator. Leveraging the larger sequential bandwidth, HitGraph writes the intermediate relaxation results to DRAM when generated and reads them back when needed.…”

Section: 32mentioning

confidence: 99%

“…Another challenge to efficiently implement priority queue-based SSSP algorithms is that priority-order graph traversal prohibits many reordering techniques used in many graph accelerators [5,16,17,28,49,58], which are vitally important to reducing external memory traffic and achieving high performance. As such, many graph accelerators [5,28,58] implement the Bellman-Ford algorithm [50] that does not require a priority queue at all. However, these accelerators work best for algorithms whose amount of work is insensitive to the traversal order (e.g., SpMV and PageRank).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Accelerating SSSP for Power-Law Graphs

Chi

Guo

Cong

2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

The single-source shortest path (SSSP) problem is one of the most important and well-studied graph problems widely used in many application domains, such as road navigation, neural image reconstruction, and social network analysis. Although we have known various SSSP algorithms for decades, implementing one for largescale power-law graphs efficiently is still highly challenging today, because ① a work-efficient SSSP algorithm requires priority-order traversal of graph data, ② the priority queue needs to be scalable both in throughput and capacity, and ③ priority-order traversal requires extensive random memory accesses on graph data.In this paper, we present SPLAG to accelerate SSSP for powerlaw graphs on FPGAs. SPLAG uses a coarse-grained priority queue (CGPQ) to enable high-throughput priority-order graph traversal with a large frontier. To mitigate the high-volume random accesses, SPLAG employs a customized vertex cache (CVC) to reduce off-chip memory access and improve the throughput to read and update vertex data. Experimental results on various synthetic and realworld datasets show up to a 4.9× speedup over state-of-the-art SSSP accelerators, a 2.6× speedup over 32-thread CPU running at 4.4 GHz, and a 0.9× speedup over an A100 GPU that has 4.1× power budget and 3.4× HBM bandwidth. Such a high performance would place SPLAG in the 14th position of the Graph 500 benchmark for data intensive applications (the highest using a single FPGA) with only a 45 W power budget. SPLAG is written in high-level synthesis C++ and is fully parameterized, which means it can be easily ported to various different FPGAs with different configurations. SPLAG is open-source at https://github.com/UCLA-VAST/splag. CCS CONCEPTS• Theory of computation → Shortest paths; • Computer systems organization → Reconfigurable computing; High-level language architectures.

show abstract

Section: 32mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Accelerating SSSP for Power-Law Graphs

Chi

Guo

Cong

2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…• A general-purpose and user-friendly SpMM accelerator. Domain specific architectures [21,22,27,45] have been designed for boosting computing performance and efficiency in many application domains such as deep learning [5, 11, 12, 23, 31, 35, 47, 64-69, 77, 87, 88], dense linear algebra [23,29,30,35,77], graph processing [4,7,17,25,26,39,48,56,70,89,91,92,95], genomic and bio analysis [8,8,9,13,14,33,38,51,76,81], and data sorting [10,52,60,63]. However, most accelerators are designed for one specific problem with fixed input and output size.…”

Section: Motivationmentioning

confidence: 99%

“…• Challenge 3 -How to design a general-purpose accelerator which does not need to be rerun the time-consuming flow of synthesis/place/route. While many accelerators have been designed for boosting computing performance and efficiency in many application domains such as deep learning [5, 11, 12, 23, 31, 35, 64-69, 77, 87, 88], dense linear algebra [23,29,30,35,77], graph processing [4,17,25,26,39,70,89,91,92,95], genomic and bio analysis [8,9,13,14,33,38,51,76,81], data sorting [10,52,60,63], most are designed for one specific problem with fixed input and output size. For FPGA accelerators even with improved tools such as [17,77], a new design will still consume many hours or even a few days due to long synthesis and place/route time.…”

Section: Introductionmentioning

confidence: 99%

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Song

Chi

Sohrabizadeh

et al. 2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges -(1) the random memory accessing and unbalanced load in processing because of random distribution of elements in sparse matrices, (2) inefficient data handling of the large matrices which can not be fit on-chip, and (3) a non-general-purpose accelerator design where one accelerator can only process a fixed-size problem.In this paper, we present Sextans, an accelerator for generalpurpose SpMM processing. Sextans accelerator features (1) fast random access using on-chip memory, (2) streaming access to offchip large matrices, (3) PE-aware non-zero scheduling for balanced workload with an II=1 pipeline, and (4) hardware flexibility to enable prototyping the hardware once to support SpMMs of different size as a general-purpose accelerator. We leverage high bandwidth memory (HBM) for the efficient accessing of both sparse and dense matrices. In the evaluation, we present an FPGA prototype Sextans which is executable on a Xilinx U280 HBM FPGA board and a projected prototype Sextans-P with higher bandwidth competitive to V100 and more frequency optimization. We conduct a comprehensive evaluation on 1,400 SpMMs on a wide range of sparse matrices including 50 matrices from SNAP and 150 from SuiteSparse. We compare Sextans with NVIDIA K80 and V100 GPUs. Sextans achieves a 2.50x geomean speedup over K80 GPU and Sextans-P achieves a 1.14x geomean speedup over V100 GPU (4.94x over K80). The code is available at https://github.com/linghaosong/Sextans.

show abstract

“…规则应用编程范式的核心难点是如何针对规则应用中的复杂数据流访存行为, 结合动态重构存储系统的特性扩展现有的流式处理编程范式 [58] . 而在处理面向以图计算为代表的非规则应用时, 流式处理会带来大范围的随机访存, 严重影响系统性能, 需要考虑采用对数据进行分块处理的方法 [59] . 非规则应用的程范式的核心难点是如何利用数据分块减少随机访存的范围, 通过存储系统的动态重构充分复用分块.…”

Section: 可重构计算模型的提出unclassified

Reconfigurable computing: toward software defined chips

Wei¹,

Li²,

Jian-feng³

et al. 2020

Sci. Sin.-Inf.

View full text Add to dashboard Cite

specific integrated circuit, ASIC), 将因为生命周期过短, 面临一次性工程成本(non-recurring engineering, NRE) 过高的难题. 与此同时, 随着摩尔定律 (Moore's law) 和迪纳徳定律 (Dennard scaling) 走向终结, 未来集成电路工艺更新带来的能效收益越来越小, 通用处理器可实现的计算能力被芯片功耗约束. 近几年兴起的领域定制加速器 (domain-specific accelerator, DSA) 通过针对特定应用领域的计算模式, 定制芯片架构, 以期兼顾能量效率和特定领域内的灵活性. 但目前 DSA 面向硬件定制软件, 这导致软件生态碎片化, 程序员学习成本增大. 未来芯片设计需要兼顾灵活性、能量效率和可编程性. 软件定义芯片 (software-defined chip, SDC) 在这一需求下成为了研究热点. 可重构芯片通过融合处理器的高灵活性、ASIC 的高能效, 并通过重构提供了在运行时根据软件定制芯片架构的能力, 是当前 SDC 的研究热点. 本文首先回顾 SDC 的研究动机, 然后分析可重构芯片如何满足 SDC 的需求, 之后探讨当前可重构芯片面临的挑战, 最后阐述为了实现 SDC, 可重构芯片未来的发展方向.

show abstract

HitGraph: High-throughput Graph Processing Framework on FPGA

Cited by 81 publications

References 33 publications

Accelerating SSSP for Power-Law Graphs

Accelerating SSSP for Power-Law Graphs

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Reconfigurable computing: toward software defined chips

Contact Info

Product

Resources

About