GraphMMU: Memory Management Unit for Sparse Graph Accelerators

Kapre, Nachiket; Jianglei, Han; Bean, Andrew; Moorthy, Pradeep; Siddhartha, _

doi:10.1109/ipdpsw.2015.101

Cited by 4 publications

(3 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We operate the MMU in optimized Scatter-Gather mode [9]. This allows the AXI DMA engine to avoid requiring frequent assistance from the CPU and enables somewhat independent operation.…”

Section: Memory Access Optimizationmentioning

confidence: 99%

A Case for Embedded FPGA-based SoCs in Energy-Efficient Acceleration of Graph Problems

Kapre

Moorthy

2015

JSFI

View full text Add to dashboard Cite

Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory access with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91-95 MTEPS for graphs as large as 32 million nodes and edges.

show abstract

“…We operate the MMU in optimized Scatter-Gather mode [9]. This allows the AXI DMA engine to avoid requiring frequent assistance from the CPU and enables somewhat independent operation.…”

Section: Memory Access Optimizationmentioning

confidence: 99%

A Case for Embedded FPGA-based SoCs in Energy-Efficient Acceleration of Graph Problems

Kapre

Moorthy

2015

JSFI

View full text Add to dashboard Cite

show abstract

“…Another problem is the mismatch between the hardware processing throughput and the off-chip bandwidth. For example, Figure 3.1(a) shows a small computational kernel, the 'gradient' benchmark from the medical imaging domain [92], while Table 3.1 [93]. While this memory bottleneck could be solved using a streaming interface directly between the off-chip sensors and the FPGA fabric, or directly between an external Dynamic random-access memory (DRAM) and the FPGA fabric, it does show that time-sharing the FU among multiple operations of the kernel, using a CGRA-like TM overlay, may be a feasible alternative.…”

Section: Discussionmentioning

confidence: 99%

“…The architectural enhancements help reduce the II significantly with just a modest increase in the area overhead, thus improving the compute efficiency. Compared to the original version, the modified overlays can achieve up to 2.4× higher throughput in GOPS, 93.7% higher compute efficiency in MOPS/eSlice and a 43.7% lower latency in ns.…”

Section: Contributionsmentioning

confidence: 99%

Time-multiplexed FPGA overlays with linear interconnect

Li¹

View full text Add to dashboard Cite

Maskell for providing me such a wonderful opportunity to explore and work in the field of reconfigurable computing. His rigorous research attitude and insightful guidance inspired me to realize the power of critical reasoning and helped me grow into an independent researcher. He has given me all the freedom to pursue my research, while unobtrusively ensuring that I do not deviate from the right direction throughout my journey towards this degree. To be honest, this thesis would not have been possible without his support and advice. I gratefully acknowledge the contributions of Associate Professor Suhaib Fahmy who has given me many critical and instructive comments for the papers we have cooperated. I would also like to express my gratitude to my senior fellow Dr. Abhishek Jain who has provided numerous help to me and whose great passion in this research field has deeply motivated me to move forward firmly. I am profoundly grateful to work as a member of the Hardware and Embedded Systems Lab

show abstract