2015 IEEE International Parallel and Distributed Processing Symposium Workshop 2015
DOI: 10.1109/ipdpsw.2015.101
|View full text |Cite
|
Sign up to set email alerts
|

GraphMMU: Memory Management Unit for Sparse Graph Accelerators

Abstract: Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zedboard, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGAbased AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregul… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 7 publications
0
3
0
Order By: Relevance
“…We operate the MMU in optimized Scatter-Gather mode [9]. This allows the AXI DMA engine to avoid requiring frequent assistance from the CPU and enables somewhat independent operation.…”
Section: Memory Access Optimizationmentioning
confidence: 99%
“…We operate the MMU in optimized Scatter-Gather mode [9]. This allows the AXI DMA engine to avoid requiring frequent assistance from the CPU and enables somewhat independent operation.…”
Section: Memory Access Optimizationmentioning
confidence: 99%
“…Another problem is the mismatch between the hardware processing throughput and the off-chip bandwidth. For example, Figure 3.1(a) shows a small computational kernel, the 'gradient' benchmark from the medical imaging domain [92], while Table 3.1 [93]. While this memory bottleneck could be solved using a streaming interface directly between the off-chip sensors and the FPGA fabric, or directly between an external Dynamic random-access memory (DRAM) and the FPGA fabric, it does show that time-sharing the FU among multiple operations of the kernel, using a CGRA-like TM overlay, may be a feasible alternative.…”
Section: Discussionmentioning
confidence: 99%
“…The architectural enhancements help reduce the II significantly with just a modest increase in the area overhead, thus improving the compute efficiency. Compared to the original version, the modified overlays can achieve up to 2.4× higher throughput in GOPS, 93.7% higher compute efficiency in MOPS/eSlice and a 43.7% lower latency in ns.…”
Section: Contributionsmentioning
confidence: 99%