FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations

Zhang, Yichi; Pan, Junhao; Liu, Xinheng; Chen, Hongzheng; Chen, Deming; Zhang, Zhiru

doi:10.1145/3431920.3439296

Cited by 74 publications

(40 citation statements)

References 43 publications

(71 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with CPUs and GPUs, FPGAs have the unique capability of customizing the control flow and data paths, which has demonstrated tremendous potential in various application domains, including stencil computations [7,8,19,38], neural networks [33,52,56], and general graph algorithms [5,28,55]. This makes the FPGA a naturally good candidate platform for SSSP acceleration, since the high-throughput on-chip priority queues [2,36] enable effective control over the trade-off between parallelism and the amount of work [1,35].…”

Section: Introductionmentioning

confidence: 99%

Accelerating SSSP for Power-Law Graphs

Chi

Guo

Cong

2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

The single-source shortest path (SSSP) problem is one of the most important and well-studied graph problems widely used in many application domains, such as road navigation, neural image reconstruction, and social network analysis. Although we have known various SSSP algorithms for decades, implementing one for largescale power-law graphs efficiently is still highly challenging today, because ① a work-efficient SSSP algorithm requires priority-order traversal of graph data, ② the priority queue needs to be scalable both in throughput and capacity, and ③ priority-order traversal requires extensive random memory accesses on graph data.In this paper, we present SPLAG to accelerate SSSP for powerlaw graphs on FPGAs. SPLAG uses a coarse-grained priority queue (CGPQ) to enable high-throughput priority-order graph traversal with a large frontier. To mitigate the high-volume random accesses, SPLAG employs a customized vertex cache (CVC) to reduce off-chip memory access and improve the throughput to read and update vertex data. Experimental results on various synthetic and realworld datasets show up to a 4.9× speedup over state-of-the-art SSSP accelerators, a 2.6× speedup over 32-thread CPU running at 4.4 GHz, and a 0.9× speedup over an A100 GPU that has 4.1× power budget and 3.4× HBM bandwidth. Such a high performance would place SPLAG in the 14th position of the Graph 500 benchmark for data intensive applications (the highest using a single FPGA) with only a 45 W power budget. SPLAG is written in high-level synthesis C++ and is fully parameterized, which means it can be easily ported to various different FPGAs with different configurations. SPLAG is open-source at https://github.com/UCLA-VAST/splag. CCS CONCEPTS• Theory of computation → Shortest paths; • Computer systems organization → Reconfigurable computing; High-level language architectures.

show abstract

Section: Introductionmentioning

confidence: 99%

Accelerating SSSP for Power-Law Graphs

Chi

Guo

Cong

2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…We implement HiSparse using high-level synthesis (HLS). While recent years have seen a rapidly increasing adoption of HLS for accelerator development, a majority of existing HLS designs target dense computations, such as dense matrix multiplication [9][10][11], image/video processing [12][13][14], and convolutional neural networks [15][16][17]. Developing high-performance sparse accelerators using HLS is more challenging because the irregular compute pattern of sparse workloads causes bank conflicts and carried dependencies.…”

Section: Introductionmentioning

confidence: 99%

High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS

Zhou

et al. 2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

Sparse linear algebra operators are memory bound due to low compute to memory access ratio and irregular data access patterns. The exceptional bandwidth improvement provided by the emerging high-bandwidth memory (HBM) technologies, coupled with the ability of FPGAs to customize the memory hierarchy and compute engines, brings the potential to significantly boost the performance of sparse linear algebra operators.In this paper we identify four challenges when developing highperformance sparse linear algebra accelerators on HBM-equipped FPGAs -low HBM bandwidth utilization with conventional sparse storage, limited on-chip memory capacity being the bottleneck when scaling to multiple HBM channels, low compute occupancy due to bank conflicts and inter-iteration carried dependencies, and timing closure on multi-die heterogeneous fabrics. We conduct an in-depth case study on sparse matrix-vector multiplication (SpMV) to explore techniques that tackle the four challenges. These techniques include (1) a customized sparse matrix format tailored for HBMs, (2) a scalable on-chip buffer design that combines replication and banking, (3) best practices of using HLS to implement hardware modules that dynamically resolve bank conflicts and carried dependencies for achieving high compute occupancy, and (4) a splitkernel design methodology for frequency optimization. Using the techniques, we demonstrate HiSparse, a high-performance SpMV accelerator on a multi-die HBM-equipped FPGA device. We evaluated HiSparse on a variety of matrix datasets. The results show that HiSparse achieves a high frequency and delivers promising speedup with increased bandwidth efficiency when compared to prior arts on CPUs, GPUs, and FPGAs. HiSparse is available at https://github.com/cornell-zhang/HiSparse. CCS CONCEPTS• Hardware → Hardware accelerators; • Computer systems organization → Data flow architectures.

show abstract

“…Field Programmable Gate Arrays (FPGAs) are increasingly prominent in modern heterogeneous computer systems. Specialized hardware designs provide unprecedented efficiency in domains such as machine learning [74,83,101,102,122,127,128], compression [92,125], database operations [88,96,104], graph processing [36,47,112,129], networking [41,52,111], and storage virtualization [78]. To realize the benefits of FPGAs, systems researchers have built operating systems [53,73,77,106], virtualization support [42,46,80,85,113,120,123,124], just-in-time compilers [97], and high-level synthesis tools [43,44,61,116,117].…”

Section: Introductionmentioning

confidence: 99%

Debugging in the brave new world of reconfigurable hardware

Zuo

Loughlin

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Software and hardware development cycles have traditionally been quite distinct. Software allows post-deployment patches, which leads to a rapid development cycle. In contrast, hardware bugs that are found after fabrication are extremely costly to fix (and sometimes even unfixable), so the traditional hardware development cycle involves massive investment in extensive simulation and formal verification. Reconfigurable hardware, such as a Field Programmable Gate Array (FPGA), promises to propel hardware development towards an agile software-like development approach, since it enables a hardware developer to patch bugs that are detected during on-chip testing or in production. Unfortunately, FPGA programmers lack bug localization tools amenable to this rapid development cycle, since past tools mainly find bugs via simulation and verification. To develop hardware bug localization tools for a rapid development cycle, a thorough understanding of the symptoms, root causes, and fixes of hardware bugs is needed.In this paper, we first study bugs in existing FPGA designs and produce a testbed of reliably-reproducible bugs. We classify the bugs according to their intrinsic properties, symptoms, and root causes. We demonstrate that many hardware bugs are comparable to software bug counterparts, and would benefit from similar techniques for bug diagnosis and repair. Based upon our findings, we build a novel collection of hybrid static/dynamic program analysis and monitoring tools for debugging FPGA designs, showing that our tools enable a software-like development cycle by effectively reducing developers' manual efforts for bug localization. CCS CONCEPTS• Hardware → Reconfigurable logic and FPGAs; • Software and its engineering → Software testing and debugging.

show abstract

FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations

Cited by 74 publications

References 43 publications

Accelerating SSSP for Power-Law Graphs

Accelerating SSSP for Power-Law Graphs

High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS

Debugging in the brave new world of reconfigurable hardware

Contact Info

Product

Resources

About