Handbook of Floating-Point Arithmetic

Muller, Jean-Michel; Brisebarre, Nicolas; Dinechin, Florent de; Jeannerod, Claude-Pierre; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Stehlé, Damien; Torres, Serge

doi:10.1007/978-0-8176-4705-6

Cited by 346 publications

(293 citation statements)

References 0 publications

Supporting

Mentioning

291

Contrasting

Unclassified

Order By: Relevance

“…Given a floating-point system of precision p and minimal exponent e min , the ulp of any real number x can be defined as follows [22], [23]:…”

Section: Unit In the Last Place (Ulp): Definition And Propertiesmentioning

confidence: 99%

Simultaneous Floating-Point Sine and Cosine for VLIW Integer Processors

Jeannerod

Jourdan-Lu

2012

2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors

Self Cite

View full text Add to dashboard Cite

Abstract-Graphics and signal processing applications often require that sines and cosines be evaluated at a same floatingpoint argument, and in such cases a very fast computation of the pair of values is desirable. This paper studies how 32-bit VLIW integer architectures can be exploited in order to perform this task accurately for IEEE single precision. We describe software implementations for sinf, cosf, and sincosf over [−π/4, π/4] that have a proven 1-ulp accuracy and whose latency on STMicroelectronics' ST231 VLIW integer processor is 19, 18, and 19 cycles, respectively. Such performances are obtained by introducing a novel algorithm for simultaneous sine and cosine that combines univariate and bivariate polynomial evaluation schemes.

show abstract

“…Given a floating-point system of precision p and minimal exponent e min , the ulp of any real number x can be defined as follows [22], [23]:…”

Section: Unit In the Last Place (Ulp): Definition And Propertiesmentioning

confidence: 99%

Simultaneous Floating-Point Sine and Cosine for VLIW Integer Processors

Jeannerod

Jourdan-Lu

2012

2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors

Self Cite

View full text Add to dashboard Cite

show abstract

“…They therefore perfectly match requirements of high-performance embedded applications where general-purpose CPUs can not be used due to high power consumption and/or too low performance and FPGAs require prohibitive development effort. DP arithmetic is known to be especially challenging for implementation on reconfigurable logic as it combines wide (64 bit) operands, multiple wide and small arithmetic operators and complex control paths, leading to both high resource usage and low frequency [1]. While contemporary FPGAs feature sufficient resources to implement multiple DP floating-point operators [2], it is not evident that the same is true for the limited reconfigurable resources of a hybrid CPU.…”

Section: Introductionmentioning

confidence: 99%

Native Double Precision LINPACK Implementation on a Hybrid Reconfigurable CPU

Huynh

Mücke

Gansterer

2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

View full text Add to dashboard Cite

Applications requiring double precision (DP) arithmetic executed on embedded CPUs without native DP support suffer from prohibitively low performance and power efficiency. Hybrid reconfigurable CPUs, allowing for reconfiguration of the instruction set at runtime, appear as a viable computing platform for applications requiring instructions not supported by existing fixed architectures. Our experiments on a Stretch S6 as prototypical platform show that limited reconfigurable resources on such architectures are sufficient for providing native support of DP arithmetic. Our design using a DP fused multiply-accumulate (FMA) extension instruction achieves a peak performance of 200 MFlop/s and a sustained performance of 22.7 MFlop/s at a clock frequency of 100 MHz. It outperforms LINPACK using software-emulated DP floating-point arithmetic on the S6 by a factor of 5.7 while achieving slightly higher numerical accuracy. In single precision, multiple floating-point operators can be implemented in parallel on the S6.

show abstract

“…There are numerous papers about optimization of floatingpoint processing, either at algorithmic level [5] or implementation level [6]. A lot of papers cover the implementation of floating-point processing in dedicated accelerators [7] or in FPGAs [8].…”

Section: Introductionmentioning

confidence: 99%

Software acceleration of floating-point multiplication using runtime code generation — Student paper

Aracil

Couroussé

2013

2013 4th Annual International Conference on Energy Aware Computing Systems and Applications (ICEAC)

View full text Add to dashboard Cite

Floating-point units are seldom in highly constrained systems, due to silicon and energy footprint, but emulated instead in algorithms based on integer arithmetic. In this paper, we use runtime code generation to generate outperforming flexible and optimized floating-point routines. On a Texas Instrument MSP430 fitted with only 512 bytes of RAM, we achieved mean speedups of 1032 % and 52 %, with tuning features enabling peaks up to 2012 % and 64 %, respectively for floating-point multiplication and an applicative case. At the best of our knowledge, runtime code generation was never achieved with such few computing and memory resources. I. INTRODUCTIONEmbedded systems typically exhibit features far from those of general purpose computing systems. The instruction set is reduced to the basis and memories (either flash and RAM) can be several orders of magnitude under those of general purpose computing systems. In order to satisfy cost, silicon surface and energy requirements, they use simpler hardware architecture, involving more stress on software algorithms. Because of their silicon and energy footprint, floating-point units (FPU) are either seldom or reluctantly included in specific branches of embedded systems wherein energy efficiency prevails on computation speed. Sensor Networks are a prime example of such branch, because they are composed of minimalist nodes disconnected from any power outlet, dedicated to specific and relatively simple tasks that cannot afford heavy architectures, and because they prevail cost and energy autonomy on raw computation power.There are however cases where one cannot afford a dedicated hardware, and where the acceleration of floating-point processing is still desirable. This is the main motivation for the work presented in this paper.If the target processor lacks a FPU, the static compiler selects software emulation for floating-point processing. These emulation routines have a strong impact on performance, because the processing of mantissa and exponent is performed with integer arithmetic, and rounding operation is then necessary to comply with the IEEE-754 floating-point representation. CPU architectures less than 32 bits, which still compose the major part of sensor network nodes, turn these routines heavier. They cannot handle easily the 32-bits word length of single-precision floating-point format, further increasing register and memory pressure.Static compilers are blind concerning the values to be computed, preventing any optimization of runtime values even

show abstract

Handbook of Floating-Point Arithmetic

Cited by 346 publications

References 0 publications

Simultaneous Floating-Point Sine and Cosine for VLIW Integer Processors

Simultaneous Floating-Point Sine and Cosine for VLIW Integer Processors

Native Double Precision LINPACK Implementation on a Hybrid Reconfigurable CPU

Software acceleration of floating-point multiplication using runtime code generation — Student paper

Contact Info

Product

Resources

About

Handbook of Floating-Point Arithmetic

Cited by 346 publications

References 0 publications

Simultaneous Floating-Point Sine and Cosine for VLIW Integer Processors

Simultaneous Floating-Point Sine and Cosine for VLIW Integer Processors

Native Double Precision LINPACK Implementation on a Hybrid Reconfigurable CPU

Software acceleration of floating-point multiplication using runtime code generation &#x2014; Student paper

Contact Info

Product

Resources

About

Software acceleration of floating-point multiplication using runtime code generation — Student paper