Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks

Wijeratne, Sasindu; Jayaweera, Sandaruwan; Dananjaya, Mahesh; Pasqual, Ajith

doi:10.1109/asap.2018.8445087

Cited by 7 publications

(2 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Traditional CPU designs are challenged in meeting the computational requirement of big data analytics applications. Accelerators for such applications on Field Programmable Logic Arrays (FPGAs) have become an attractive solution due to their massive parallelism, low power consumption, and costefficiency [1] [2]. Unfortunately, effective external DRAM memory bandwidth and access latency have become the bottleneck in such accelerators [3] [4].…”

Section: Introductionmentioning

confidence: 99%

Programmable FPGA-based Memory Controller

Wijeratne¹,

Pattnaik²,

Chen³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Even with generational improvements in DRAM technology, memory access latency still remains the major bottleneck for application accelerators, primarily due to limitations in memory interface IPs which cannot fully account for variations in target applications, the algorithms used, and accelerator architectures. Since developing memory controllers for different applications is time-consuming, this paper introduces a modular and programmable memory controller that can be configured for different target applications on available hardware resources. The proposed memory controller efficiently supports cache-line accesses along with bulk memory transfers. The user can configure the controller depending on the available logic resources on the FPGA, memory access pattern, and external memory specifications. The modular design supports various memory access optimization techniques including, request scheduling, internal caching, and direct memory access. These techniques contribute to reducing the overall latency while maintaining high sustained bandwidth. We implement the system on a state-of-theart FPGA and evaluate its performance using two widely studied domains: graph analytics and deep learning workloads. We show improved overall memory access time up to 58% on CNN and GCN workloads compared with commercial memory controller IPs.

show abstract

Section: Introductionmentioning

confidence: 99%

Programmable FPGA-based Memory Controller

Wijeratne¹,

Pattnaik²,

Chen³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This makes them extremely demanding in terms of silicon real estate, especially memory, as well as compute performance and power. To bring this computation closer to the edge in resource-constrained devices, recently there has been considerable interest in building special-purpose hard-ware accelerators to support inference [3]- [6], training [7], [8] as well as compilers to bridge the gap between software simulation and hardware acceleration [9]. However, while microarchitectural techniques have been able to improve on the efficiency of neural network processing, it is nowhere near the biological neocortex, which is not only substantially deeper and wider but is also significantly more efficient in terms of energy and data [10].…”

Section: Introductionmentioning

confidence: 99%

An Adaptive Memory Management Strategy Towards Energy Efficient Machine Inference in Event-Driven Neuromorphic Accelerators

Saha

Duwe

Zambreno

2019

2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

View full text Add to dashboard Cite

Spiking neural networks are viable alternatives to classical neural networks for edge processing in low-power embedded and IoT devices. To reap their benefits, neuromorphic network accelerators that tend to support deep networks still have to expend great effort in fetching synaptic states from a large remote memory. Since local computation in these networks is event-driven, memory becomes the major part of the system's energy consumption. In this paper, we explore various opportunities of data reuse that can help mitigate the redundant traffic for retrieval of neuron meta-data and post-synaptic weights. We describe CyNAPSE, a baseline neural processing unit and its accompanying software simulation as a general template for exploration on various levels. We then investigate the memory access patterns of three spiking neural network benchmarks that have significantly different topology and activity. With a detailed study of locality in memory traffic, we establish the factors that hinder conventional cache management philosophies from working efficiently for these applications. To that end, we propose and evaluate a domain-specific management policy that takes advantage of the forward visibility of events in a queue-based event-driven simulation framework. Subsequently, we propose network-adaptive enhancements to make it robust to network variations. As a result, we achieve 13-44% reduction in system power consumption and a 8-23% improvement over conventional replacement policies.

show abstract