Feature Map Transform Coding for Energy-Efficient CNN Inference

Chmiel, Brian; Baskin, Chaim; Banner, Ron; Zheltonozhskii, Evgenii; Yermolin, Yevgeny; Karbachevsky, Alex; Bronstein, Alex M.; Mendelson, Avi

doi:10.48550/arxiv.1905.10830

Cited by 2 publications

(5 citation statements)

References 30 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reducing the PE count lowers the compute bound on the roofline, but, at the same time, the use of SRAM increases operation density (i.e., moves the green dots in Figure 13 to the right), possibly within hardware capabilities. Alternative solutions for the memory-bound problem include changing the CNN architecture (for example, using smaller amount of wide layers [46]), or adding a data compression scheme on the way to and from the memory [40,41,47].…”

Section: System-level Design Methodologymentioning

confidence: 99%

“…Nonetheless, this balance can be achieved in different ways: at the micro-architecture level, at the algorithmic level, or by changing the data representation. The architect may also consider: (1) changing the hardware to provide faster communication (which requires more power and is more expensive), (2) applying communication bandwidth compression algorithms [40,41], (3) using fewer bits to represent weights and activations (using 3-or 4-bit representation may solve the communication problem, at the cost of reducing the expected accuracy), or (4) changing the algorithm to transfer the data slower (even though that solves the bandwidth issue, the possible drawback is reduced throughput of the whole system). The proposed OPS-based roofline model helps the architect to choose between alternatives.…”

Section: Roofline Analysis Examplesmentioning

confidence: 99%

See 1 more Smart Citation

Early-Stage Neural Network Hardware Performance Analysis

et al. 2021

View full text Add to dashboard Cite

The demand for running NNs in embedded environments has increased significantly in recent years due to the significant success of convolutional neural network (CNN) approaches in various tasks, including image recognition and generation. The task of achieving high accuracy on resource-restricted devices, however, is still considered to be challenging, which is mainly due to the vast number of design parameters that need to be balanced. While the quantization of CNN parameters leads to a reduction of power and area, it can also generate unexpected changes in the balance between communication and computation. This change is hard to evaluate, and the lack of balance may lead to lower utilization of either memory bandwidth or computational resources, thereby reducing performance. This paper introduces a hardware performance analysis framework for identifying bottlenecks in the early stages of CNN hardware design. We demonstrate how the proposed method can help in evaluating different architecture alternatives of resource-restricted CNN accelerators (e.g., part of real-time embedded systems) early in design stages and, thus, prevent making design mistakes.

show abstract

Section: System-level Design Methodologymentioning

confidence: 99%

Section: Roofline Analysis Examplesmentioning

confidence: 99%

Early-Stage Neural Network Hardware Performance Analysis

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Another way to reduce memory bandwidth is by compressing the intermediate activations prior to their transfer to memory with some computationally cheap encoding, such that Huffman (Chandra, 2018;Chmiel et al, 2019) or run-length (RLE) encoding (Cavigelli et al, 2019). A similar approach of storing only nonzero values was utilized by Lin & Lai (2018).…”

Section: Related Workmentioning

confidence: 99%

“…A similar approach of storing only nonzero values was utilized by Lin & Lai (2018). (Chmiel et al, 2019)…”

Section: Related Workmentioning

confidence: 99%

“…At test time, we apply entropy coding on the activations before writing them to memory, thus reducing the amount of memory transactions. In contrast to Chmiel et al (2019), who avoided fine-tuning by using test time transformation in order to reduce entropy, our method does not requires complex transformations at test time by inducing low entropy during training.…”

Section: Differentiable Entropy-reducing Lossmentioning

confidence: 99%

See 1 more Smart Citation

CAT: Compression-Aware Training for bandwidth reduction

Baskin,

Chmiel,

Zheltonozhskii

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving visual processing tasks. One of the major obstacles hindering the ubiquitous use of CNNs for inference is their relatively high memory bandwidth requirements, which can be a main energy consumer and throughput bottleneck in hardware accelerators. Accordingly, an efficient feature map compression method can result in substantial performance gains. Inspired by quantization-aware training approaches, we propose a compression-aware training (CAT) method that involves training the model in a way that allows better compression of feature maps during inference. Our method trains the model to achieve low-entropy feature maps, which enables efficient compression at inference time using classical transform coding methods. CAT significantly improves the state-of-the-art results reported for quantization. For example, on ResNet-34 we achieve 73.1% accuracy (0.2% degradation from the baseline) with an average representation of only 1.79 bits per value. Reference implementation accompanies the paper.

show abstract

Feature Map Transform Coding for Energy-Efficient CNN Inference

Cited by 2 publications

References 30 publications

Early-Stage Neural Network Hardware Performance Analysis

Early-Stage Neural Network Hardware Performance Analysis

CAT: Compression-Aware Training for bandwidth reduction

Contact Info

Product

Resources

About