An Operand-Optimized Asynchronous IEEE 754 Double-Precision Floating-Point Adder

Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

et al. 2015

Self Cite

Self-timed chip designs are commonly specified in a high-level message-passing language called CHP [21]. This language is closely related to Hoare's CSP [11] except it admits erroneous behavior due to the necessary limitations of efficient hardware implementations. For example, two processes sending on the same channel at the same time causes glitches and short circuits in the physical chip implementation. If a CHP program maintains certain invariants, such as only one process is sending on any given channel at a time, it can guarantee an error-free execution that behaves much like a CSP program would. In this paper, we present an inferable effect system for ensuring that these invariants hold, drawing from model-checking methodologies while exploiting language-usage patterns and domain-specific specializations to achieve efficiency. This analysis is sound, and is even complete for the common subset of CHP programs without data-sensitive synchronization. We have implemented the analysis and demonstrated that it scales to validate even microprocessors.

Section: Self-timed Vlsimentioning

confidence: 99%

Preventing glitches and short circuits in high-level self-timed chip specifications

Longfield

Nkounkou

Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

et al. 2015

Self Cite

“…Unlike in the FPA datapath where total power is distributed roughly evenly amongst a number of different logic blocks [24], the FPM's complexity is largely a function of its 53x53 multiplier. This is highlighted in Figure 2 which shows the power breakdown estimates of our baseline fully QDI FPM datapath.…”

Section: Floating-point Multiplier Power Breakdownmentioning

confidence: 99%

“…A floating-point multiplier consumes significantly more energy compared to a floating-point adder (FPA) [21,24]. This combined with the knowledge that the frequency of floating-point multiplication operations in emerging applications is similar to that of floating-point addition computations makes energy and power optimizations in the FPM datapath highly essential for an efficient full floating-point unit (FPU) design.…”

Section: Introductionmentioning

confidence: 99%

“…Their FPU has many orders of magnitude higher latency compared to all recent floating-point designs from synchronous domain. Sheikh et al [24] employed fine-grain asynchronous circuit techniques for various operand-dependent optimization techniques to reduce average-case power consumption in the FPA datapath. However, their work is restricted to FPA design only.…”

Section: A Asynchronous Multipliers and Floating-point Arithmeticmentioning

confidence: 99%

See 1 more Smart Citation

An Asynchronous Floating-Point Multiplier

Sheikh

2012 IEEE 18th International Symposium on Asynchronous Circuits and Systems

2012

Self Cite

Abstract-We present the details of our energy-efficient asynchronous floating-point multiplier (FPM). We discuss design trade-offs of various multiplier implementations. A higher radix array multiplier design with operand-dependent carrypropagation adder and low handshake overhead pipeline design is presented, which yields significant energy savings while preserving the average throughput. Our FPM also includes a hardware implementation of denormal and underflow cases. When compared against a custom synchronous FPM design, our asynchronous FPM consumes 3X less energy per operation while operating at 2.3X higher throughput. To our knowledge, this is the first detailed design of a high-performance asynchronous IEEE-754 compliant double-precision floating-point multiplier.Keywords-Floating point arithmetic; asynchronous logic circuits; very-large-scale integration; pipeline processing I. INTRODUCTION Energy-efficient floating-point computation is important for a wide range of applications. Traditionally, VLSI designers primarily relied on CMOS technology and voltage scaling to reduce power consumption [4]. With the transistor threshold voltage fixed [10], V DD has been scaling very slowly if at all, which means all performance improvements come at an increased energy consumption. Furthermore, process variations in deep sub-micron range have made devices far less robust, which is increasingly making it difficult for synchronous designers to overcome the problems associated with clock skew rates and clock distribution [6]. The findings of a recent in-depth study, to explore and devise ways to further scale supercomputer petaFLOP performance by 1000X, indicate the inadequacy of current design practices and technologies to achieve the desired throughput within a sustainable power budget [1]. This underscores a pressing need for alternate design practices, to reduce energy consumption for floating-point computations while preserving robust behavior in advanced technology nodes.At the other end of the spectrum, embedded systems that have traditionally been considered low performance are demanding higher and higher throughput for the same power budget to support compute-intensive floating-point applications that improve the user experience. Since these applications have to be deployed on portable devices with limited batterylife, it is critical that we develop energy-efficient floatingpoint hardware for these embedded systems, not simply high performance floating-point hardware.The IEEE 754 standard [19] for binary floating-point arithmetic provides a precise specification of floating-point number formats, computation operations, and exceptions and their handling. The combination of a vast range of inputs, special cases, and rounding modes makes the hardware implementation of fully IEEE 754 standard compliant floating-point arithmetic a very challenging task. Ignoring certain aspects of the standard can lead to unexpected consequences in the context of numerical algorithms. Hence, most floating-point hardware is IEEE-comp...

“…The QDI circuits have been used in numerous high-performance, energy-efficient asynchronous designs [Sheikh and Manohar 2010] [D. Fang and Manohar 2005], including a fullyimplemented and fabricated asynchronous microprocessor [Martin et al 1997].…”

Section: Introductionmentioning

confidence: 99%

Energy-Efficient Pipeline Templates for High-Performance Asynchronous Circuits

Sheikh

J. Emerg. Technol. Comput. Syst.

2011

Self Cite

We present two novel energy-efficient pipeline templates for high throughput asynchronous circuits. The proposed templates, called N-P and N-Inverter pipelines, use single-track handshake protocol. There are multiple stages of logic within each pipeline. The proposed techniques minimize handshake overheads associated with input tokens and intermediate logic nodes within a pipeline template. Each template can pack significant amount of logic in a single stage, while still maintaining a fast cycle time of only 18 transitions. Noise and timing robustness constraints of our pipelined circuits are quantified across all process corners. A completion detection scheme based on wide NOR gates is presented, which results in significant latency and energy savings especially as the number of outputs increase. To fully quantify all design trade-offs, three separate pipeline implementations of an 8x8-bit Booth-encoded array multiplier are presented. Compared to a standard QDI pipeline implementation, the N-Inverter and N-P pipeline implementations reduced the energy-delay product by 38.5% and 44% respectively. The overall multiplier latency was reduced by 20.2% and 18.7%, while the total transistor width was reduced by 35.6% and 46% with N-Inverter and N-P pipeline templates respectively.