Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.
CEUS improves diagnostic performance in differentiating HCCs from non-neoplastic nodules in cirrhotic patients compared with baseline ultrasound. Diagnosis of HCCs ≤2.0 cm diameter by CEUS is still a clinical concern, and thus needs further investigation.
Three-dimensional convolutional neural networks (3D CNNs) have gained popularity in many complicated computer vision applications. Many customized accelerators based on FPGAs are proposed for 2D CNNs, while very few are for 3D CNNs. Three-D CNNs are far more computationally intensive and the design space for 3D CNN acceleration has been further expanded since one more dimension is introduced, making it a big challenge to accelerate 3D CNNs on FPGAs. Motivated by the finding that the computation patterns of 2D and 3D CNNs are very similar, we propose a uniform architecture design for accelerating both 2D and 3D CNNs in this paper. The uniform architecture is based on the idea of mapping convolutions to matrix multiplications. A customized mapping module is developed to generate the feature matrix tilings with no need to store the entire enlarged feature matrix on-chip or off-chip, a splitting strategy is adopted to reconstruct a convolutional layer to adapt to the on-chip memory capacity, and a 2D multiply-and-accumulate (MAC) array is adopted to compute matrix multiplications efficiently. For demonstration, we implement an accelerator prototype with a high-level synthesis (HLS) methodology on a Xilinx VC709 board and test the accelerator on three typical CNN models: AlexNet, VGG16, and C3D. Experimental results show that the accelerator achieves state-of-the-art throughput performance on both 2D and 3D CNNs, with much better energy efficiency than the CPU and GPU.
A theoretical analysis has been carried out to compare the tunneling processes in a double-quantumwell three-barrier {DQW-TB) system and a single-quantum-well double-barrier {SQW-DB) system. Based on a general WKB formula, it is shown that a symmetric DQW-TB system with transparencymatched barriers is far superior to the SQW-DB system in a number of aspects, including the peak current, the peak-to-valley ratio, and the speed limit.
Speaker recognition is a crucial bio-identification technology, which is extensively used in our daily life. With the development of deep learning, convolutional neural networks (CNNs) are applied to speaker recognition tasks given their excellent performance. However, in real life, speaker recognition systems are frequently deployed on end-devices. Therefore, while obtaining recognition accuracy, the model of speaker recognition is expected to be as simple as possible. Inspired by 1-max pooling CNN and Gaussian mixture model-universal background model (GMM-UBM), this study proposes a one dimension convolutional neural networks (1D CNN) on the basis of original 2D CNN. The proposed model reduces the computational complexity of ResNet20 by 64% and the amount of parameters by 53%. In comparison with the original ResNet20 models, the recognition accuracy will be reduced by about one percent on the 15s data set. Then, on the basis of the 1D CNN, we propose a pyramid layer-folding pipeline structure and implement it on the Xilinx VC709 platform. According to the time-dimension partition, the proposed pyramid pipeline structure can process speech data of various lengths. Moreover, our accelerator is 5.1× faster on 3s dataset and 6.8× quicker on 15s dataset than those of the CPU platform.INDEX TERMS Speaker recognition,1D convolution neural networks, pyramid pipeline, folding pipeline, FPGA.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.