Area Efficient and Fast Combined Binary/Decimal Floating Point Fused Multiply Add Unit

Wahba, Ahmed A.; Fahmy, Hossam A. H.

doi:10.1109/tc.2016.2584067

Cited by 15 publications

(9 citation statements)

References 18 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LZA is pre-corrected operand to calculate the number of leading zeros. LZA is composed of two vectors computation followed by leading zero detector(LZD) [13]. In order to make this fused FDP much faster pipeline concepts are implemented, by replacing traditional ripple Double based number system(DBNS) needs O( k logk ) addition operations to perform k-bit multiplication operation [14].…”

Section: Proposed Architecturesmentioning

confidence: 99%

Area and Power Efficient Fused Floating-point Dot Product Unit based on Radix-2r Multiplier & Pipeline Feedforward-Cutset-Free Carry-Lookahead Adder

Babu¹

2021

ITII

View full text Add to dashboard Cite

Fused floating point operations play a major role in many DSP applications to reduce operational area & power consumption. Radix-2r multiplier (using 7-bit encoder technique) & pipeline feedforward-cutset-free carry-lookahead adder(PFCF-CLA) are used to enhance the traditional FDP unit. Pipeline concept is also infused into system to get the desired pipeline fused floating-point dot product (PFFDP) operations. Synthesis results are obtained using 60nm standard library with 1GHz clock. Power consumption of single & double precision operations are 2.24mW & 3.67mW respectively. The die areas are 27.48 mm2 , 46.72mm2 with an execution time of 1.91 ns , 2.07 ns for a single & double precision operations respectively. Comparison with previous data has also been performed. The area-delay product(ADP) & power-delay product(PDP) of our proposed architecture are 18%,22% & 27%,18% for single and double precision operations respectively.

show abstract

Section: Proposed Architecturesmentioning

confidence: 99%

Area and Power Efficient Fused Floating-point Dot Product Unit based on Radix-2r Multiplier & Pipeline Feedforward-Cutset-Free Carry-Lookahead Adder

Babu¹

2021

ITII

View full text Add to dashboard Cite

show abstract

“…This effect is due to the fact that a set of finite radix-10 numbers becomes periodic when represented in radix-2 notation. Wahba et al [11] present a solution reducing by 6% percent the latency of an FP decimal unit compared to SoA solutions, and saving 23% of the total area compared to solutions that include two FP units (for binary and decimal support, respectively). Decimal FPUs are characterized by a longer critical path and a larger area than binary units since representing a decimal digit requires four bits.…”

Section: Alternative Formatsmentioning

confidence: 99%

A transprecision floating-point cluster for efficient near-sensor data analytics

Montagna¹,

Mach²,

Benatti³

et al. 2020

Preprint

View full text Add to dashboard Cite

Recent applications in the domain of near-sensor computing require the adoption of floating-point arithmetic to reconcile high precision results with a wide dynamic range. In this paper, we propose a multi-core computing cluster that leverages the fined-grained tunable principles of transprecision computing to provide support to near-sensor applications at a minimum power budget. Our design -based on the open-source RISC-V architecture -combines parallelization and sub-word vectorization with near-threshold operation, leading to a highly scalable and versatile system. We perform an exhaustive exploration of the design space of the transprecision cluster on a cycle-accurate FPGA emulator, with the aim to identify the most efficient configurations in terms of performance, energy efficiency, and area efficiency. We also provide a full-fledged software stack support, including a parallel runtime and a compilation toolchain, to enable the development of end-to-end applications. We perform an experimental assessment of our design on a set of benchmarks representative of the near-sensor processing domain, complementing the timing results with a post place-&-route analysis of the power consumption. Finally, a comparison with the state-of-the-art shows that our solution outperforms the competitors in energy efficiency, reaching a peak of 97 Gflop/s/W on single-precision scalars and 162 Gflop/s/W on half-precision vectors.

show abstract

“…However, in this state-of-the-art approach [ 24 ], the multipliers and the adder tree are still two separate computation components. On the other hand, some previous multiply-accumulate (MAC) designs [ 25 , 26 , 27 , 28 ] have tried to reduce the overheads caused by final additions of multiplications. However, since these MAC designs [ 25 , 26 , 27 , 28 ] assume that only one multiplier is used, their approaches cannot be directly applied to the design of 2-D convolver hardware circuit.…”

Section: Introductionmentioning

confidence: 99%

Convolver Design and Convolve-Accumulate Unit Design for Low-Power Edge Computing

Kao

Chen

Huang

2021

Sensors

View full text Add to dashboard Cite

Convolution operations have a significant influence on the overall performance of a convolutional neural network, especially in edge-computing hardware design. In this paper, we propose a low-power signed convolver hardware architecture that is well suited for low-power edge computing. The basic idea of the proposed convolver design is to combine all multipliers’ final additions and their corresponding adder tree to form a partial product matrix (PPM) and then to use the reduction tree algorithm to reduce this PPM. As a result, compared with the state-of-the-art approach, our convolver design not only saves a lot of carry propagation adders but also saves one clock cycle per convolution operation. Moreover, the proposed convolver design can be adapted for different dataflows (including input stationary dataflow, weight stationary dataflow, and output stationary dataflow). According to dataflows, two types of convolve-accumulate units are proposed to perform the accumulation of convolution results. The results show that, compared with the state-of-the-art approach, the proposed convolver design can save 15.6% power consumption. Furthermore, compared with the state-of-the-art approach, on average, the proposed convolve-accumulate units can reduce 15.7% power consumption.

show abstract

Area Efficient and Fast Combined Binary/Decimal Floating Point Fused Multiply Add Unit

Cited by 15 publications

References 18 publications

Area and Power Efficient Fused Floating-point Dot Product Unit based on Radix-2r Multiplier & Pipeline Feedforward-Cutset-Free Carry-Lookahead Adder

Area and Power Efficient Fused Floating-point Dot Product Unit based on Radix-2r Multiplier & Pipeline Feedforward-Cutset-Free Carry-Lookahead Adder

A transprecision floating-point cluster for efficient near-sensor data analytics

Convolver Design and Convolve-Accumulate Unit Design for Low-Power Edge Computing

Contact Info

Product

Resources

About