Abstract:Quantized low-precision neural networks are very popular because they require less computational resources for inference and can provide high performance, which is vital for real-time and embedded recognition systems. However, their advantages are apparent for FPGA and ASIC devices, while general-purpose processor architectures are not always able to perform low-bit integer computations efficiently. The most frequently used low-precision neural network model for mobile central processors is an 8-bit quantized … Show more
“…The first term of (3) presents matrix multiplication of quantized matrices: 8-bit with 32-bit product in case of gemmlowp and 4-bit with 16-bit product in case of [20]. The second and third terms do not depend on j and i respectively, so they are easier to compute: in terms of algorithmic complexity, the first term requires O(mnk), the second -O(mk), the third -O(nk), and the fourth -O(1) operations.…”
Section: B Integer Gemmmentioning
confidence: 99%
“…In GeMM-based convolution it limits the number of channels in the input feature map [20]. Let us consider convolution with H k ×W k kernel.…”
Section: B Integer Gemmmentioning
confidence: 99%
“…In this section, we demonstrate the efficiency of the proposed ternary (TNN), ternary-binary (TBN), and binary (BNN) matrix multiplication on ARM Aarch64 CPUs and compare them to known efficient algorithms: binary from daBNN library [22] (daBNN), 8-bit from gemmlowp library [29] (U8), 4-bit from [20] with a microkernel upscaled to 24 × 8 size (U4, the original size was 24 × 4 for ARMv7 architecture), and our implementation of floating-point 32-bit baseline which uses the same register layout as gemmlowp, but computes operations in floating-point (F32).…”
“…Widely used 8-bit quantization allows for a 4-times reduction of network size and significant speedup on mobile CPUs while maintaining the quality close to full precision models [17]. 4-bit QNNs demonstrate a noticeable drop in recognition quality on challenging tasks [18], [19]; still, 4-bit quantization can be used to accelerate CPU inference of small CNNs significantly [20]. The most memory-efficient quantization is binarization: in binary QNNs (BNNs), weights and activations only take the values of 1 or −1 and require a single bit for storage.…”
Section: Introductionmentioning
confidence: 99%
“…In the experimental section of our work, we compare the performance of the proposed algorithms to computationallyefficient algorithms of matrix multiplication for different data types: 32-bit floating-point, 8-bit integer from Google's gemmlowp library [29], 4-bit presented in [20], and binary from daBNN library [22].…”
Low-bit quantized neural networks (QNNs) are of great interest in practical applications because they significantly reduce the consumption of both memory and computational resources. Binary neural networks (BNNs) are memory and computationally efficient as they require only one bit per weight and activation and can be computed using Boolean logic and bit count operations. QNNs with ternary weights and activations (TNNs) and binary weights and ternary activations (TBNs) aim to improve recognition quality compared to BNNs while preserving low bit-width. However, their efficient implementation is usually considered on ASICs and FPGAs, limiting their applicability in real-life tasks. At the same time, one of the areas where efficient recognition is most in demand is recognition on mobile devices using their CPUs. However, there are no known fast implementations of TBNs and TNN, only the daBNN library for BNNs inference. In this paper, we propose novel fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. In our algorithms, ternary weights are represented using 2-bit encoding and binary -using one bit. It allows us to replace matrix multiplication with Boolean logic operations that can be computed on 128bits simultaneously, using ARM NEON SIMD extension. The matrix multiplication results are accumulated in 16-bit integer registers. We also use special reordering of values in left and right matrices. All that allows us to efficiently compute a matrix product while minimizing the number of loads and stores compared to the algorithm from daBNN. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8bit, and 4-bit quantized matrix multiplications. Our experiment shows our implementations of ternary and ternary-binary matrix multiplications to have almost the same inference time, and they are 3.6 times faster than full-precision, 2.5 times faster than 8bit quantized, and 1.4 times faster than 4-bit quantized matrix multiplication but 2.9 slower than binary matrix multiplication.
“…The first term of (3) presents matrix multiplication of quantized matrices: 8-bit with 32-bit product in case of gemmlowp and 4-bit with 16-bit product in case of [20]. The second and third terms do not depend on j and i respectively, so they are easier to compute: in terms of algorithmic complexity, the first term requires O(mnk), the second -O(mk), the third -O(nk), and the fourth -O(1) operations.…”
Section: B Integer Gemmmentioning
confidence: 99%
“…In GeMM-based convolution it limits the number of channels in the input feature map [20]. Let us consider convolution with H k ×W k kernel.…”
Section: B Integer Gemmmentioning
confidence: 99%
“…In this section, we demonstrate the efficiency of the proposed ternary (TNN), ternary-binary (TBN), and binary (BNN) matrix multiplication on ARM Aarch64 CPUs and compare them to known efficient algorithms: binary from daBNN library [22] (daBNN), 8-bit from gemmlowp library [29] (U8), 4-bit from [20] with a microkernel upscaled to 24 × 8 size (U4, the original size was 24 × 4 for ARMv7 architecture), and our implementation of floating-point 32-bit baseline which uses the same register layout as gemmlowp, but computes operations in floating-point (F32).…”
“…Widely used 8-bit quantization allows for a 4-times reduction of network size and significant speedup on mobile CPUs while maintaining the quality close to full precision models [17]. 4-bit QNNs demonstrate a noticeable drop in recognition quality on challenging tasks [18], [19]; still, 4-bit quantization can be used to accelerate CPU inference of small CNNs significantly [20]. The most memory-efficient quantization is binarization: in binary QNNs (BNNs), weights and activations only take the values of 1 or −1 and require a single bit for storage.…”
Section: Introductionmentioning
confidence: 99%
“…In the experimental section of our work, we compare the performance of the proposed algorithms to computationallyefficient algorithms of matrix multiplication for different data types: 32-bit floating-point, 8-bit integer from Google's gemmlowp library [29], 4-bit presented in [20], and binary from daBNN library [22].…”
Low-bit quantized neural networks (QNNs) are of great interest in practical applications because they significantly reduce the consumption of both memory and computational resources. Binary neural networks (BNNs) are memory and computationally efficient as they require only one bit per weight and activation and can be computed using Boolean logic and bit count operations. QNNs with ternary weights and activations (TNNs) and binary weights and ternary activations (TBNs) aim to improve recognition quality compared to BNNs while preserving low bit-width. However, their efficient implementation is usually considered on ASICs and FPGAs, limiting their applicability in real-life tasks. At the same time, one of the areas where efficient recognition is most in demand is recognition on mobile devices using their CPUs. However, there are no known fast implementations of TBNs and TNN, only the daBNN library for BNNs inference. In this paper, we propose novel fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. In our algorithms, ternary weights are represented using 2-bit encoding and binary -using one bit. It allows us to replace matrix multiplication with Boolean logic operations that can be computed on 128bits simultaneously, using ARM NEON SIMD extension. The matrix multiplication results are accumulated in 16-bit integer registers. We also use special reordering of values in left and right matrices. All that allows us to efficiently compute a matrix product while minimizing the number of loads and stores compared to the algorithm from daBNN. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8bit, and 4-bit quantized matrix multiplications. Our experiment shows our implementations of ternary and ternary-binary matrix multiplications to have almost the same inference time, and they are 3.6 times faster than full-precision, 2.5 times faster than 8bit quantized, and 1.4 times faster than 4-bit quantized matrix multiplication but 2.9 slower than binary matrix multiplication.
Recently, deep neural network (DNN) acceleration has been critical for hardware systems from mobile/edge devices to high-performance data centers. Especially, for on-device AI, there have been many studies on hardware numerical precision reduction considering the limited hardware resources of mobile/edge devices. Although layer-wise mixed-precision leads to computational complexity reduction, it is not straightforward to find a well-balanced layer-wise precision scheme since it takes a long time to determine the optimal precision for each layer due to the repetitive experiments and the model accuracy, the fundamental measure of deep learning quality, should be considered as well. In this paper, we propose the layer-wise mixed precision scheme which can significantly reduce the time required to determine the optimal hardware numerical precision with Signal-to-Quantization Noise Ratio (SQNR)-based analysis. In addition, the proposed scheme can take the hardware complexity into consideration in terms of the number of operations (OPs) or weight memory requirement of each layer. The proposed method can be directly applied to inference, meaning that users can utilize well-trained neural network models without the need for additional training or hardware units. With the proposed SQNR-based analysis, for SSDlite and YOLOv2 networks, the analysis time required for layer-wise precision determination is reduced by more than 95% compared to conventional mean Average Precision(mAP)-based analysis. Also, with the proposed complexity-aware schemes, the number of OPs and weight memory requirement can be reduced by up to 20.94% and 37.67%, respectively, for SSDlite, and by up to 96.68% and 88.53%, respectively, for YOLOv2, with negligible model accuracy degradation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.