Large multipliers with fewer DSP blocks

Dinechin, Florent de; Pasca, Bogdan

doi:10.1109/fpl.2009.5272296

Cited by 65 publications

(83 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our previous work [28], an efficient squarer was proposed which consumes up to 50% less hardware resources than an equivalent width multiplier. It can use 1 fewer DSP block than the method in [29] at a cost of only 127 additional LUTs. The architecture for a 52-bit squarer is shown in Figure 4.…”

Section: Fpga Computationmentioning

confidence: 99%

See 1 more Smart Citation

Square-rich fixed point polynomial evaluation on FPGAs

Xu¹,

Fahmy

McLoughlin

2014

Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5 th degree polynomials.

show abstract

Section: Fpga Computationmentioning

confidence: 99%

“…Note that in order to achieve maximum DSP block frequency, it is necessary to add an additional register stage any time a DSP block output is passed to LUTs (for implemented small adders for example). This has not been taken into account in [29] and [13], but is done by default in this work.…”

Section: Fpga Computationmentioning

confidence: 99%

Square-rich fixed point polynomial evaluation on FPGAs

Xu¹,

Fahmy

McLoughlin

2014

Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…If the input size of an operand is not an exact multiple of the inputs (m, n) of the embedded multiplier, the last digit obtained by the decomposition is zero-padded to match the nearest multiple. Exploiting the leading zeros and approaches similar to the non-standard tiling [18] is a task considered for future.…”

Section: Operand Decompositionmentioning

confidence: 99%

“…The authors in [18] explore three alternative types of large integer multiplier generation for FPGAs: Karatsuba-Ofman algorithm, non-standard tiling (an alternate, less regular form of divide and conquer) and specialized squarers. The Karatsuba-Ofman algorithm trades multiplications for additions by rearranging the creation of partial products and thereby reducing the number of multipliers/DSP blocks required.…”

Section: Related Workmentioning

confidence: 99%

Automatic generation of high-performance multipliers for FPGAs with asymmetric multiplier blocks

Srinath

Compton

2010

Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

The introduction of asymmetric embedded multiplier blocks in recent Xilinx FPGAs complicates the design of larger multiplier sizes. The two different input bitwidths of the embedded multipliers lead to two different shifting factors for the partial products that must be summed. This makes even the most straightforward multiplier design less intuitive. In this thesis, I present a methodology and set of equations to automatically generate Verilog hardware description code for arbitrary multiplier sizes composed of arbitrarily-sized asymmetric embedded multiplier cores. The presented technique also uses intelligent rearrangement of the multiplier block outputs into partial product terms to reduce the overall delay of the circuit. Multipliers created with this generator are faster and use fewer DSP blocks than either those created using Xilinx Core Generator or those created by simply using the '*' operator in Verilog. It also uses fewer LUTs than those created using the '*' operator. Finally, the presented generator can create multipliers larger than possible with Core Generator, and is limited only by the number of available embedded multipliers.4 Acknowledgement I would like to thank my advisor Katherine Compton for providing me the opportunity to pursue this research under her supervision. I thank her for the invaluable guidance and support she has extended throughout the duration of my degree.

show abstract

“…However, the performance of the multiplier is suitable only when operand sizes are small. In this regard, Florent de Dinechin and Bogdan Pasca [3] have also presented their work in which, they used fewer DSP blocks to realize large multipliers. They demonstrated better performance, in terms of saving precious DSP blocks and maintaining the operating frequency.…”

Section: Introductionmentioning

confidence: 99%

FPGA design, simulation and prototyping of a high speed 32-bit pipeline multiplier based on Vedic mathematics

Abbasi

Zulhelmi

Alamoud

2015

IEICE Electron. Express

View full text Add to dashboard Cite

This research is about a new approach, which is used for optimizing multipliers designs, which are based on the concept of Vedic mathematics. The design has been targeted to to FPGAs (state-of-the art field-programmable gate arrays). It has been assessed that the multiplier produces partial products by utilizing Vedic mathematics concept by deploying basic 4 × 4 multipliers, which is designed by exploiting special features of multiplexers and 6-input look up tables (LUTs) on the same slices, resulting in considerable minimization in area. The multiplier has been realized on Xilinx® Virtex-5 FPGAs. It is significant to notice that pipeline adders were used to obtain final products. Furthermore, the multiplier is developed and organized by using pipeline schemes, which contribute to the enhancement of operating frequency of the multiplier. The results show that the 32-bit pipeline multiplier can work up to a clock frequency of 450 MHz. It has utilized 514 slices and 1157 flip-flops and has much less dynamic power than the other reported work.

show abstract

Large multipliers with fewer DSP blocks

Cited by 65 publications

References 2 publications

Square-rich fixed point polynomial evaluation on FPGAs

Square-rich fixed point polynomial evaluation on FPGAs

Automatic generation of high-performance multipliers for FPGAs with asymmetric multiplier blocks

FPGA design, simulation and prototyping of a high speed 32-bit pipeline multiplier based on Vedic mathematics

Contact Info

Product

Resources

About