184QPS/W 64Mb/mm<sup>2</sup>3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Niu, Dimin; Li, Shuangchen; Wang, Yuhao; Han, Wei; Zhang, Zhe; Guan, Yijin; Guan, Tianchan; Sun, Fengrui; Xue, Fei; Duan, Lide; Fang, Yuanwei; Zheng, Hongzhong; Jiang, Xiping; Wang, Song; Zuo, Fengguo; Wang, Yubing; Yu, Bing; Ren, Qiwei; Xie, Yuan

doi:10.1109/isscc42614.2022.9731694

Cited by 20 publications

(13 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, an optional FP32 (application demands high precision) general matrix-multiplication engine (GEMM) [55] and an optional vector processing unit (VPU) [64] can be added to the design. Although FPGA's FP32 TFlops is not competitive with GPU or even CPU, GEMM/VPU might be useful in latency-sensitive inference tasks with simpler model, in which case data movement from FPGA to local or remote GPU can be eliminated.…”

Section: Access Enginementioning

confidence: 99%

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Niu

Wang

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

Self Cite

View full text Add to dashboard Cite

Graph neural network (GNN) is a promising emerging application for link prediction, recommendation, etc. Existing hardware innovation is limited to single-machine GNN (SM-GNN), however, the enterprises usually adopt huge graph with large-scale distributed GNN (LSD-GNN) that has to be carried out with distributed inmemory storage. The LSD-GNN is very different from SM-GNN in terms of system architecture demand, workflow and operators, and hence characterizations.In this paper, we first quantitively characterize the LSD-GNN with industrial-grade framework and application, summarize that its challenges lie in graph sampling, including distributed graph access, long latency, and underutilized communication and memory bandwidth. These challenges are missing from previous SM-GNN targeted researches. We then propose a customized hardware architecture to solve the challenges, including a fully pipelined access engine architecture for graph access and sampling, a low-latency and bandwidth-efficient customized memory-over-fabric hardware, and a RISC-V centric control system providing good programmability. We implement the proposed architecture with full software support in a 4-card FPGA heterogeneous proof-of-concept (PoC) system. Based on the measurement result from the FPGA PoC, we demonstrate a single FPGA can provide up to 894 vCPU's sampling capability. With the goal of being profitable, programmable, and scalable, we further integrate the architecture to FPGA cloud (FaaS) at hyperscale, along with the industrial software framework. We explicitly explore eight FaaS architectures that carry out the proposed accelerator hardware. We finally conclude that off-the-shelf FaaS.base can already provide 2.47× performance per dollar improvement with our hardware. With architecture optimizations, FaaS.comm-opt with customized FPGA fabrics pushes the benefit to 7.78×, and FaaS.mem-opt with FPGA local DRAM and high-speed links to GPU further unleash the benefit to 12.58×.

show abstract

Section: Access Enginementioning

confidence: 99%

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Niu

Wang

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

Self Cite

View full text Add to dashboard Cite

show abstract

“…We anticipate consumer use-cases to continue diversifying, making a ordable-yet-exible DRAM increasingly important. Ambitious initiatives such as DRAM-system codesign [87,117,118,241,242] and emerging, non-traditional DRAM architectures [119,198,241,326,327,[357][358][359][360][361][362] will naturally engender transparency by tightening the relationship between DRAM manufacturers and system designers. Regardless of the underlying motivation, we believe that increased transparency of DRAM reliability characteristics will remain crucial to allowing system designers to make the best use of commodity DRAM chips by enabling them to customize DRAM chips for system-level goals.…”

Section: Alternative Futuresmentioning

confidence: 99%

A Case for Transparent Reliability in DRAM Systems

Patel¹,

Shahroodi²,

Manglik³

et al. 2022

Preprint

View full text Add to dashboard Cite

Mass-produced commodity DRAM is the preferred choice of main memory for a broad range of computing systems due to its favorable cost-per-bit. However, today's systems have diverse system-speci c needs (e.g., performance, energy, reliability) that are di cult to address using one-size-ts-all generalpurpose DRAM. Unfortunately, although system designers can theoretically adapt commodity DRAM chips to meet their particular design goals (e.g., by exploiting slack in access timings to improve performance, or implementing system-level RowHammer mitigations), we observe that designers today lack the necessary insight into commodity DRAM chips' reliability characteristics to implement these techniques in practice. In this work, we make a case for DRAM manufacturers to provide increased transparency into simple device characteristics (e.g., internal row address mapping, cell array organization) that a ect consumer-visible reliability. Doing so has negligible impact on manufacturers given that these characteristics can be reverse-engineered using known techniques; however, it has signi cant bene t for system designers, who can then make informed decisions to be er adapt commodity DRAM to meet modern systems' needs while preserving its cost advantages.To support our argument, we study four ways that system designers can adapt commodity DRAM chips to system-speci c design goals: (1) improving DRAM reliability; (2) reducing DRAM refresh overheads; (3) reducing DRAM access latency; and (4) defending against RowHammer a acks. We observe that adopting solutions for any of the four goals requires system designers to make assumptions about a DRAM chip's reliability characteristics. ese assumptions discourage system designers from using such solutions in practice due to the di culty of both making and relying upon the assumption.We identify DRAM standards as the root of the problem: current standards rigidly enforce a xed operating point with no speci cations for how a system designer might explore alternative operating points. To overcome this problem, we introduce a two-step approach that reevaluates DRAM standards with a focus on transparency of reliability characteristics so that system designers are encouraged to make the most of commodity DRAM technology for both current and future DRAM chips.

show abstract

“…Many works from academia [2, 10-12, 15-23, 25, 31, 35-39, 48, 81-83, 85, 86, 90, 99, 104-112] and industry [34,[41][42][43][50][51][52][53][54] have shown the benefits of PnM and PuM for a wide range of workloads from different domains. However, fully adopting PIM in commercial systems is still very challenging due to the lack of tools and system support for PIM architectures across the computer architecture stack [4], which includes: (i) workload characterization methodologies and benchmark suites targeting PIM architectures; (ii) frameworks that can facilitate the implementation of complex operations and algorithms using the underlying PIM primitives (e.g., simple PIM arithmetic operations [19], bulk bitwise Boolean in-DRAM operations [83,84,92]); (iii) compiler support and compiler optimizations targeting PIM architectures; (iv) operating system support for PIM-aware virtual memory, memory management, data allocation and mapping; and (v) efficient data coherence and consistency mechanisms.…”

Section: Motivation and Problemmentioning

confidence: 99%

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

F.¹,

Gómez-Luna²,

Ghose³

et al. 2022

Preprint

View full text Add to dashboard Cite

184QPS/W 64Mb/mm²3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Cited by 20 publications

References 5 publications

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

A Case for Transparent Reliability in DRAM Systems

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

Contact Info

Product

Resources

About

184QPS/W 64Mb/mm23D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Cited by 20 publications

References 5 publications

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

A Case for Transparent Reliability in DRAM Systems

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

Contact Info

Product

Resources

About

184QPS/W 64Mb/mm²3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System