Floating-point units are seldom in highly constrained systems, due to silicon and energy footprint, but emulated instead in algorithms based on integer arithmetic. In this paper, we use runtime code generation to generate outperforming flexible and optimized floating-point routines. On a Texas Instrument MSP430 fitted with only 512 bytes of RAM, we achieved mean speedups of 1032 % and 52 %, with tuning features enabling peaks up to 2012 % and 64 %, respectively for floating-point multiplication and an applicative case. At the best of our knowledge, runtime code generation was never achieved with such few computing and memory resources.
I. INTRODUCTIONEmbedded systems typically exhibit features far from those of general purpose computing systems. The instruction set is reduced to the basis and memories (either flash and RAM) can be several orders of magnitude under those of general purpose computing systems. In order to satisfy cost, silicon surface and energy requirements, they use simpler hardware architecture, involving more stress on software algorithms. Because of their silicon and energy footprint, floating-point units (FPU) are either seldom or reluctantly included in specific branches of embedded systems wherein energy efficiency prevails on computation speed. Sensor Networks are a prime example of such branch, because they are composed of minimalist nodes disconnected from any power outlet, dedicated to specific and relatively simple tasks that cannot afford heavy architectures, and because they prevail cost and energy autonomy on raw computation power.There are however cases where one cannot afford a dedicated hardware, and where the acceleration of floating-point processing is still desirable. This is the main motivation for the work presented in this paper.If the target processor lacks a FPU, the static compiler selects software emulation for floating-point processing. These emulation routines have a strong impact on performance, because the processing of mantissa and exponent is performed with integer arithmetic, and rounding operation is then necessary to comply with the IEEE-754 floating-point representation. CPU architectures less than 32 bits, which still compose the major part of sensor network nodes, turn these routines heavier. They cannot handle easily the 32-bits word length of single-precision floating-point format, further increasing register and memory pressure.Static compilers are blind concerning the values to be computed, preventing any optimization of runtime values even