Pradeep Moorthy scite author profile

Pradeep Moorthy

4Publications

17Citation Statements Received

31Citation Statements Given

How they've been cited

How they cite others

Affiliations

Nanyang Technological University, Nanyang Institute of Technology, Anna University, Chennai

Publications

Order By: Most citations

Zedwulf: Power-Performance Tradeoffs of a 32-Node Zynq SoC Cluster

Moorthy

Kapre

2015

View full text Add to dashboard Cite

Commodity SoCs with hybrid architectures that combine CPUs with programmable FPGA fabric such as the Xilinx Zynq SoC have become a competitive energy-efficient platform for addressing irregular parallelism in graph problems. In this paper, we prototype a 32-node cluster composed from these Zynq SoC chips to accelerate communication-bound sparse graphoriented applications such as neural network simulations. We develop specialized MPI routines specifically developed for irregular accelerator-to-accelerator communication of small message traffic. We use the ARM processor for handling the MPI stack while offloading compute-intensive calculations to the FPGA. For graphs with 32M nodes and 32M edges, Zedwulf delivers the highest 94 MTEPS (Million Traversed Edges Per Second) throughput over other x86 multi-threaded platforms in our study by 1.2-1.4⇥. For this experiment, Zedwulf operates at an efficiency of 0.49 MTEPS/W when using ARM+FPGA which is 1.2⇥ better than using ARMv7 CPUs alone, and within 8% of the Intel Core i7-4770k platform.

show abstract

GraphMMU: Memory Management Unit for Sparse Graph Accelerators

Kapre

Jianglei

Bean

et al. 2015

View full text Add to dashboard Cite

Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zedboard, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGAbased AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregular traffic generated from sparse graph access patterns, we observe a performance penalty of almost 10⇥ due to DRAM row activations when compared to cache-friendly sequential access. When using an AXI DMA engine configured in FPGA logic and programmed in AXI register mode from the CPU, we can improve DRAM performance by as much as 2.4⇥ over naïve random access on the CPU. In this mode, we use the host CPU to trigger DMA transfer by writing appropriate control information in the internal register of the DMA engine. We also encode the sparse graph access patterns as locally-stored BRAM-hosted AXI descriptor chains to drive the AXI DMA engines with minimal CPU involvement under Scatter Gather mode. In this configuration, we deliver an additional 3⇥ speedup, for a cumulative throughput improvement of 7⇥ over a CPU-based approach using caches while running an OS to manage irregular access.

show abstract

A Case for Embedded FPGA-based SoCs in Energy-Efficient Acceleration of Graph Problems

Kapre

Moorthy

2015

JSFI

View full text Add to dashboard Cite

Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory access with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91-95 MTEPS for graphs as large as 32 million nodes and edges.

show abstract

Analysis and Implementation of Parallel Low-Complexity Motion Estimation

Subramanian

Chandrababu

Moorthy

et al. 2007

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Pradeep Moorthy

Zedwulf: Power-Performance Tradeoffs of a 32-Node Zynq SoC Cluster

GraphMMU: Memory Management Unit for Sparse Graph Accelerators

A Case for Embedded FPGA-based SoCs in Energy-Efficient Acceleration of Graph Problems

Analysis and Implementation of Parallel Low-Complexity Motion Estimation

Contact Info

Product

Resources

About