CNN Acceleration With Hardware-Efficient Dataflow for Super-Resolution

Lee, Sumin; Joo, Sunghwan; Ahn, Hong Keun; Jung, Seong‐Ook

doi:10.1109/access.2020.3031055

Cited by 18 publications

(23 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experiments, performed on the Xilinx XC7K410T field programmable gate array (FPGA) chip, demonstrated the benefits of the proposed approach in terms of area occupancy and energy saving over several state-of-the-art counterparts. In fact, the new accelerator exhibited a logic resource requirement and a power consumption up to ~63% and ~48% lower, respectively, than previous designs [ 11 , 13 , 14 , 15 , 16 , 17 ]. The adopted parallelism and the achieved 227 MHz running frequency allow the above advantages to be obtained without compromising the competitiveness of the proposed design in terms of speed performance.…”

Section: Introductionmentioning

confidence: 93%

“…Unfortunately, these characteristics may represent a bottleneck for those application scenarios in which real time and low power are mandatory. For this reason, designing ad-hoc hardware accelerators suitable for exploitation also within time- and power-constrained operating environments has recently received a great deal of attention [ 11 , 12 , 13 , 14 , 15 , 16 , 17 , 19 , 20 , 21 , 22 , 23 ]. Among the possible hardware realization platforms, FPGAs are widely recognized as powerful solutions [ 11 , 13 , 15 , 17 , 20 ] for merging the benefits from custom hardware designs, such as computational parallelism and limited energy consumption, with the strengths of software designs, including reconfigurability and short time to market.…”

Section: Background and Related Workmentioning

confidence: 99%

“…While several of the existing hardware designs support both CONVs and TCONVs [ 11 , 13 , 14 , 15 , 16 , 17 , 19 , 21 ], some of them are tailored to accomplish only TCONVs [ 12 , 22 , 23 ]. As an example, the FPGA accelerator proposed in our previous work [ 12 ] deals with the input-oriented method (IOM) to reduce, or completely avoid, useless operations, corresponding to multiplications by zero, introduced by the conventional zero-TCONVs’ up-sampling approach.…”

Section: Background and Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Design of Flexible Hardware Accelerators for Image Convolutions and Transposed Convolutions

2021

View full text Add to dashboard Cite

Nowadays, computer vision relies heavily on convolutional neural networks (CNNs) to perform complex and accurate tasks. Among them, super-resolution CNNs represent a meaningful example, due to the presence of both convolutional (CONV) and transposed convolutional (TCONV) layers. While the former exploit multiply-and-accumulate (MAC) operations to extract features of interest from incoming feature maps (fmaps), the latter perform MACs to tune the spatial resolution of the received fmaps properly. The ever-growing real-time and low-power requirements of modern computer vision applications represent a stimulus for the research community to investigate the deployment of CNNs on well-suited hardware platforms, such as field programmable gate arrays (FPGAs). FPGAs are widely recognized as valid candidates for trading off computational speed and power consumption, thanks to their flexibility and their capability to also deal with computationally intensive models. In order to reduce the number of operations to be performed, this paper presents a novel hardware-oriented algorithm able to efficiently accelerate both CONVs and TCONVs. The proposed strategy was validated by employing it within a reconfigurable hardware accelerator purposely designed to adapt itself to different operating modes set at run-time. When characterized using the Xilinx XC7K410T FPGA device, the proposed accelerator achieved a throughput of up to 2022.2 GOPS and, in comparison to state-of-the-art competitors, it reached an energy efficiency up to 2.3 times higher, without compromising the overall accuracy.

show abstract

Section: Introductionmentioning

confidence: 93%

Section: Background and Related Workmentioning

confidence: 99%

Section: Background and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Design of Flexible Hardware Accelerators for Image Convolutions and Transposed Convolutions

2021

View full text Add to dashboard Cite

show abstract

“…Then, the WCB generates the final 8-b output by combining four 4-b elements of data. The four VAs of kth kernel operation represent the product as 15 15…”

Section: B Macro Architecturementioning

confidence: 99%

“…This problem is referred to as the von Neumann bottleneck or memory wall [11]. Several innovative approaches have been presented to address this issue [12]- [15].…”

Section: Introductionmentioning

confidence: 99%

10T SRAM Computing-in-Memory Macros for Binary and Multibit MAC Operation of DNN Edge Processors

2021

View full text Add to dashboard Cite

Computing-in-memory (CIM) is a promising approach to reduce latency and improve the energy efficiency of the multiply-and-accumulate (MAC) operation under a memory wall constraint for artificial intelligence (AI) edge processors. This paper proposes an approach focusing on scalable CIM designs using a new ten-transistor (10T) static random access memory (SRAM) bit-cell. Using the proposed 10T SRAM bit-cell, we present two SRAM-based CIM (SRAM-CIM) macros supporting multibit and binary MAC operations. The first design achieves fully parallel computing and high throughput using 32 parallel binary MAC operations. Advanced circuit techniques such as an input-dependent dynamic reference generator and an input-boosted sense amplifier are presented. Fabricated in 28 nm CMOS process, this design achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS/mm 2 throughput area efficiency. The proposed approach effectively solves previous problems such as writing disturb, throughput, and the power consumption of an analog to digital converter (ADC). The second design supports multibit MAC operation (4-b weight, 4-b input, and 8-b output) to increase the inference accuracy. We propose an architecture that divides 4-b weight and 4-b input multiplication to four 2-b multiplication in parallel, which increases the signal margin by 16× compared to conventional 4-b multiplication. Besides, the capacitive digital-to-analog converter (CDAC) area issue is effectively addressed using the intrinsic bit-line capacitance existing in the SRAM-CIM architecture. The proposed approach of realizing four 2-b parallel multiplication using the CDAC is successfully demonstrated with a modified LeNet-5 neural network. These results demonstrate that the proposed 10T bit-cell is promising for realizing robust and scalable SRAM-CIM designs, which is essential for realizing fully parallel edge computing.INDEX TERMS computing-in-memory, static random access memory, deep neural network, machine learning, edge processor. * value when technology scaling factor is used. ** result when CONV1 and FL7 layers are implemented in the SRAM-CIM.

show abstract

Medical image super-resolution

Al-Olofi,

Rushdi

2024

Artificial Intelligence and Image Processing in Medical Imaging

View full text Add to dashboard Cite

CNN Acceleration With Hardware-Efficient Dataflow for Super-Resolution

Cited by 18 publications

References 36 publications

Design of Flexible Hardware Accelerators for Image Convolutions and Transposed Convolutions

Design of Flexible Hardware Accelerators for Image Convolutions and Transposed Convolutions

10T SRAM Computing-in-Memory Macros for Binary and Multibit MAC Operation of DNN Edge Processors

Medical image super-resolution

Contact Info

Product

Resources

About