Nimish Shah scite author profile

IEEE Trans. Neural Netw. Learning Syst.

Chaudhari

Varghese

2018

The deep convolutional neural network (DCNN) is a class of machine learning algorithms based on feed-forward artificial neural network and is widely used for image processing applications. Implementation of DCNN in real-world problems needs high computational power and high memory bandwidth, in a power-constrained environment. A general purpose CPU cannot exploit different parallelisms offered by these algorithms and hence is slow and energy inefficient for practical use. We propose a field-programmable gate array (FPGA)-based runtime programmable coprocessor to accelerate feed-forward computation of DCNNs. The coprocessor can be programmed for a new network architecture at runtime without resynthesizing the FPGA hardware. Hence, it acts as a plug-and-use peripheral for the host computer. Caching is implemented for input features and filter weights using on-chip memory to reduce the external memory bandwidth requirement. Data are prefetched at several stages to avoid stalling of computational units and different optimization techniques are used to efficiently reuse the fetched data. Dataflow is dynamically adjusted in runtime for each DCNN layer to achieve consistent computational throughput across a wide range of input feature sizes and filter sizes. The coprocessor is prototyped using Xilinx Virtex-7 XC7VX485T FPGA-based VC707 board and operates at 150 MHz. Experimental results show that our implementation is energy efficient than highly optimized CPU implementation and achieves consistent computational throughput of more than 140 G operations/s for a wide range of input feature sizes and filter sizes. Off-chip memory transactions decrease by due to the use of the on-chip cache.

Scalable hierarchical floorplanning for fast physical prototyping of systems-on-chip

Wang

2012

Floorplanning, as an early stage of the physical design flow, has been extensively studied in literature and developed into several branches. Recently, hierarchical floorplanning is regaining attention due to the rising scale of systems-on-chip, which necessarily requires divide-and-conquer strategies to handle the increasing complexity. This paper introduces a floorplanning scheme targeting hierarchical physical prototyping, answering some of the questions posed by Kahng [8] on classical floorplanning. Our scheme emphasizes practical requirements including runtime scalability, wire length and shape quality. We formulate a new hierarchical floorplanning problem with reduced computational complexity, but without weakening the problem as a global layout optimization. To achieve this goal, a placement seed is taken as input and converted into a slicing floorplan under the given constraints of region area and aspect ratio (region shape). We solve the problem by devising an efficient slicing algorithm with integrated dynamic programming. Implementation of the algorithm shows fast runtime and good quality of result.

Acceleration of probabilistic reasoning through custom processor architecture

Olascoaga

Meert

et al. 2020

Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.

The Improved Correlation Matrix Memory (CMML)

O’Keefe

Austin

2007

Twenty years ago a paper [1] detailing a novel neural network, called ADAM and that could be directly implemented in hardware RAM, was published in the first conference of this series. Subsequent research based directly/indirectly on this type of RAM-based neural network founded a research group that has produced over 200 research documents. This paper overviews that research and goes on to mathematically define a CMML, a generalised version of a CMM (the component at the heart of ADAM). The CMML can be trained to replicate the exact computational properties of a CMM and so is a plug-and-play replacement to a CMM; whilst a different training algorithm gives it different properties when used in recall.

DPU: DAG Processing Unit for Irregular Graphs With Precision-Scalable Posit Arithmetic in 28 nm

IEEE J. Solid-State Circuits

Olascoaga

Zhao

et al. 2022