Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Zhang, Chen; Sun, Guangyu; Fang, Zhenman; Zhou, Peipei; Pan, Peichen; Cong, Jason

doi:10.1109/tcad.2017.2785257

Cited by 271 publications

(280 citation statements)

References 30 publications

Supporting

Mentioning

276

Contrasting

Unclassified

Order By: Relevance

“…NeuFlow by Farabet et al [46] is one of the first works that tackled the problem of using FPGAs for DL, in particular, for vision systems. Caffeine by Zhang et al [201] is a hardware and software co-designed library to support CNNs on FPGAs. On the hardware side, it provides a high-level synthesis implementation of an FPGA accelerator for CNNs.…”

Section: Infrastructurementioning

confidence: 99%

Scalable Deep Learning on Distributed Infrastructures

2020

View full text Add to dashboard Cite

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multitenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.One of the driving factors of the success of DL is the scale of training in three dimensions. The first dimension of scale is the size and complexity of the models themselves. Starting from simple, shallow neural networks, with increasing depth and more sophisticated model architectures, new breakthroughs in model accuracy were achieved [30,38]. The second dimension of scale is the amount of training data. The model accuracy can, to a large extent, be improved by feeding more training data into the model [56,63]. In practice, it is reported that 10s to 100s of Terabyte (TB) of training data are used in the training of a DL model [27,62]. The third dimension is the scale of the infrastructure. The availability of programmable highly-parallel hardware, especially graphics processing units (GPUs), is a key-enabler to training large models with a lot of training data in a short time [30,206].Our survey is focused on challenges that arise when managing a large, distributed infrastructure for DL. Hosting a large amount of DL models that are trained with large amounts of training data is challenging. This includes questions of parallelization, resource scheduling and elasticity, data management and portability. This field is now in rapid development, with contributions from diverse research communities such as distributed and networked systems, data management, and machine learning. At the same time, we see a number of open source DL frameworks and orchestration systems emerging [4,24,141,195]. In this survey, we bring together, classify and compare the huge body of work on distributed infrastructures for DL from the different communities that contribute to this area. Furthermore, we provide an overview and comparison of the existing open-source DL frameworks and tools that put distributed DL into practice. Finally, we highlight and discuss open research challenges in this field. Complementary SurveysThere are a number of surveys on DL that are complementary to ours. Deng [41] provides a general survey on DL architectures, algorithms and applications. LeCunn et al. pro...

show abstract

Section: Infrastructurementioning

confidence: 99%

Scalable Deep Learning on Distributed Infrastructures

2020

View full text Add to dashboard Cite

show abstract

“…DNNBuilder [16] and FP-DNN [28] propose end-to-end tools that can automatically generate optimized FPGA-based accelerators from high-level DNN symbolic descriptions in Caffe/Tensorflow frameworks. Caffeine [27] is another automation tool that provides guidelines for choosing FPGA hardware parameters, such as the number of processing elements (PEs), bit precision of variables, and parallel data factors. By using these automation tools, it is easier to bridge the gap between fast DNN construction in popular machine learning frameworks and slow implementation of targeted hardware accelerators.…”

Section: Background and Related Workmentioning

confidence: 99%

AutoDNNchip

Zhang

Hao

et al. 2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for domain-specific hardware accelerators (i.e., DNN chips). However, designing DNN chips is non-trivial because: (1) mainstream DNNs have millions of parameters and operations; (2) the design space is large due to the numerous design choices of dataflows, processing elements, memory hierarchy, etc.; and (3) an algorithm/hardware co-design is needed to allow the same DNN functionality to have a different decomposition, which would require different hardware IPs that correspond to dramatically different performance/energy/area tradeoffs. Therefore, DNN chips often take months to years to design and require a large team of cross-disciplinary experts. To enable fast and effective DNN chip design, we propose AutoDNNchip − a DNN chip generator that can automatically generate both FPGA-and ASIC-based DNN chip implementation (i.e., synthesizable RTL code with optimized algorithm-to-hardware mapping (i.e., dataflow) ) given DNNs from machine learning frameworks (e.g., PyTorch) for a designated application and dataset without humans in the loop. Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, latency, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balance, etc.), optimize chip design via the Chip Predictor, and then generate synthesizable RTL code with optimized dataflows to achieve the target design metrics. Experimental results show that our Chip Predictor's predicted performance differs from real-measured ones by <10% when validated using 15 DNN models and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, both the FPGA-and ASIC-based DNN accelerators generated by our AutoDNNchip can achieve better (up to 3.86× improvement) performance than that of expert-crafted state-of-the-art accelerators, showing the effectiveness of AutoDNNchip. Our open-source code can be found at https://github.com/RICE-EIC/AutoDNNchip.git.

show abstract

“…[17] and [18] used an OpenGL-designed architecture to accelerate AlexNet and VGG on Arria 10. A reusable CNN engine with a unified framework and a scalable PE array was proposed in [19], which provided an end-to-end solution for deploying CNN models from Caffe onto an FPGA. The motivation matches well with the gap between deep learning researchers and hardware, but there is still space to improve the performance and resource utilization.…”

Section: Related Workmentioning

confidence: 99%

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Gao¹,

Wang²,

Miao³

et al. 2019

2019 29th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

Intensive computation is entering data centers with multiple workloads of deep learning. To balance the compute efficiency, performance, and total cost of ownership (TCO), the use of a field-programmable gate array (FPGA) with reconfigurable logic provides an acceptable acceleration capacity and is compatible with diverse computation-sensitive tasks in the cloud. In this paper, we develop an FPGA acceleration platform that leverages a unified framework architecture for generalpurpose convolutional neural network (CNN) inference acceleration at a data center. To overcome the computation bound, 4,096 DSPs are assembled and shaped as supertile units (SUs) for different types of convolution, which provide up to 4.2 TOP/s 16bit fixed-point performance at 500 MHz. The interleaved-taskdispatching method is proposed to map the computation across the SUs, and the memory bound is solved by a dispatchingassembling buffering model and broadcast caches. For various non-convolution operators, a filter processing unit is designed for general-purpose filter-like/pointwise operators. In the experiment, the performances of CNN models running on server-class CPUs, a GPU, and an FPGA are compared. The results show that our design achieves the best FPGA peak performance and a throughput at the same level as that of the state-of-the-art GPU in data centers, with more than 50 times lower latency.

show abstract

Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Cited by 271 publications

References 30 publications

Scalable Deep Learning on Distributed Infrastructures

Scalable Deep Learning on Distributed Infrastructures

AutoDNNchip

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Contact Info

Product

Resources

About