Abstract:In this work we present a new 64-bit floating point Fused Multiply Add (FMA) unit that can perform both binary and decimal addition, multiplication, and fused-multiply-add operations. The presented FMA has 6% less delay than the fastest stand-alone decimal unit and 23% less area than both binary and decimal units together. These results were achieved by the use of: 1) column by column reduction to reduce the partial products in the multiplier tree, 2) a new leading zeros detector that produces its output in ba… Show more
“…LZA is pre-corrected operand to calculate the number of leading zeros. LZA is composed of two vectors computation followed by leading zero detector(LZD) [13]. In order to make this fused FDP much faster pipeline concepts are implemented, by replacing traditional ripple Double based number system(DBNS) needs O( k logk ) addition operations to perform k-bit multiplication operation [14].…”
Fused floating point operations play a major role in many DSP applications to reduce operational area & power consumption. Radix-2r multiplier (using 7-bit encoder technique) & pipeline feedforward-cutset-free carry-lookahead adder(PFCF-CLA) are used to enhance the traditional FDP unit. Pipeline concept is also infused into system to get the desired pipeline fused floating-point dot product (PFFDP) operations. Synthesis results are obtained using 60nm standard library with 1GHz clock. Power consumption of single & double precision operations are 2.24mW & 3.67mW respectively. The die areas are 27.48 mm2 , 46.72mm2 with an execution time of 1.91 ns , 2.07 ns for a single & double precision operations respectively. Comparison with previous data has also been performed. The area-delay product(ADP) & power-delay product(PDP) of our proposed architecture are 18%,22% & 27%,18% for single and double precision operations respectively.
“…LZA is pre-corrected operand to calculate the number of leading zeros. LZA is composed of two vectors computation followed by leading zero detector(LZD) [13]. In order to make this fused FDP much faster pipeline concepts are implemented, by replacing traditional ripple Double based number system(DBNS) needs O( k logk ) addition operations to perform k-bit multiplication operation [14].…”
Fused floating point operations play a major role in many DSP applications to reduce operational area & power consumption. Radix-2r multiplier (using 7-bit encoder technique) & pipeline feedforward-cutset-free carry-lookahead adder(PFCF-CLA) are used to enhance the traditional FDP unit. Pipeline concept is also infused into system to get the desired pipeline fused floating-point dot product (PFFDP) operations. Synthesis results are obtained using 60nm standard library with 1GHz clock. Power consumption of single & double precision operations are 2.24mW & 3.67mW respectively. The die areas are 27.48 mm2 , 46.72mm2 with an execution time of 1.91 ns , 2.07 ns for a single & double precision operations respectively. Comparison with previous data has also been performed. The area-delay product(ADP) & power-delay product(PDP) of our proposed architecture are 18%,22% & 27%,18% for single and double precision operations respectively.
“…This effect is due to the fact that a set of finite radix-10 numbers becomes periodic when represented in radix-2 notation. Wahba et al [11] present a solution reducing by 6% percent the latency of an FP decimal unit compared to SoA solutions, and saving 23% of the total area compared to solutions that include two FP units (for binary and decimal support, respectively). Decimal FPUs are characterized by a longer critical path and a larger area than binary units since representing a decimal digit requires four bits.…”
Recent applications in the domain of near-sensor computing require the adoption of floating-point arithmetic to reconcile high precision results with a wide dynamic range. In this paper, we propose a multi-core computing cluster that leverages the fined-grained tunable principles of transprecision computing to provide support to near-sensor applications at a minimum power budget. Our design -based on the open-source RISC-V architecture -combines parallelization and sub-word vectorization with near-threshold operation, leading to a highly scalable and versatile system. We perform an exhaustive exploration of the design space of the transprecision cluster on a cycle-accurate FPGA emulator, with the aim to identify the most efficient configurations in terms of performance, energy efficiency, and area efficiency. We also provide a full-fledged software stack support, including a parallel runtime and a compilation toolchain, to enable the development of end-to-end applications. We perform an experimental assessment of our design on a set of benchmarks representative of the near-sensor processing domain, complementing the timing results with a post place-&-route analysis of the power consumption. Finally, a comparison with the state-of-the-art shows that our solution outperforms the competitors in energy efficiency, reaching a peak of 97 Gflop/s/W on single-precision scalars and 162 Gflop/s/W on half-precision vectors.
“…However, in this state-of-the-art approach [ 24 ], the multipliers and the adder tree are still two separate computation components. On the other hand, some previous multiply-accumulate (MAC) designs [ 25 , 26 , 27 , 28 ] have tried to reduce the overheads caused by final additions of multiplications. However, since these MAC designs [ 25 , 26 , 27 , 28 ] assume that only one multiplier is used, their approaches cannot be directly applied to the design of 2-D convolver hardware circuit.…”
Convolution operations have a significant influence on the overall performance of a convolutional neural network, especially in edge-computing hardware design. In this paper, we propose a low-power signed convolver hardware architecture that is well suited for low-power edge computing. The basic idea of the proposed convolver design is to combine all multipliers’ final additions and their corresponding adder tree to form a partial product matrix (PPM) and then to use the reduction tree algorithm to reduce this PPM. As a result, compared with the state-of-the-art approach, our convolver design not only saves a lot of carry propagation adders but also saves one clock cycle per convolution operation. Moreover, the proposed convolver design can be adapted for different dataflows (including input stationary dataflow, weight stationary dataflow, and output stationary dataflow). According to dataflows, two types of convolve-accumulate units are proposed to perform the accumulation of convolution results. The results show that, compared with the state-of-the-art approach, the proposed convolver design can save 15.6% power consumption. Furthermore, compared with the state-of-the-art approach, on average, the proposed convolve-accumulate units can reduce 15.7% power consumption.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.