Stuart F. Oberman scite author profile

A table-based method for high-speed function approximation in single-precision floating-point format is presented in this paper. Our focus is the approximation of reciprocal, square root, square root reciprocal, exponentials, logarithms, trigonometric functions, powering (with a fixed exponent p), or special functions. The algorithm presented here combines table look-up, an enhanced minimax quadratic approximation, and an efficient evaluation of the second-degree polynomial (using a specialized squaring unit, redundant arithmetic, and multioperand addition). The execution times and area costs of an architecture implementing our method are estimated, showing the achievement of the fast execution times of linear approximation methods and the reduced area requirements of other second-degree interpolation algorithms. Moreover, the use of an enhanced minimax approximation which, through an iterative process, takes into account the effect of rounding the polynomial coefficients to a finite size allows for a further reduction in the size of the look-up tables to be used, making our method very suitable for the implementation of an elementary function generator in state-ofthe-art DSPs or graphics processing units (GPUs).

show abstract

Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor

Oberman

111

View full text Add to dashboard Cite

Design issues in division and other floating-point operations

Oberman

Flynn

1997

IEEE Trans. Comput.

143

View full text Add to dashboard Cite

Floating-point division is generally regarded as a low frequency, high latency operation in typical floating-point applications. However, in the worst case, a high latency hardware floating-point divider can contribute an additional 0.50 CPI to a system executing SPECfp92 applications. This paper presents the system performance impact of floating-point division latency for varying instruction issue rates. It also examines the performance implications of shared multiplication hardware, shared square root, on-the-fly rounding and conversion, and fused functional units. Using a system level study as a basis, it is shown how typical floating-point applications can guide the designer in making implementation decisions and trade-offs.

show abstract

SRT division architectures and implementations

Harris

Oberman

Horowitz

View full text Add to dashboard Cite

SRT dividers are common in modernJEoating point units.Higher division performance is achieved by retiring more quotient bits in each cycle. Previous research has shown that realistic stages are limited to radix-2 and radix-4. Higher radix dividers are therefore formed by a combination of low-radix stages. In this papel; we present an analysis of the effects of radix-2 and radix-4 SRT divider architectures and circuit families on divider urea and performance. We show the performance and area results for a wide variety of divider architectures and implementations. We conclude that divider performance is only weakly sensitive to reasonable choices of architecture but sign$cantly improved by aggressive circuit techniques.

show abstract

AMD 3DNow! technology: architecture and implementations

Oberman

Favor²,

Weber³

1999

IEEE Micro

View full text Add to dashboard Cite

The SNAP project: design of floating point arithmetic units

Oberman

Altwaijry

Flynn

View full text Add to dashboard Cite

In recent years computer applications have increased in their computational complexity. The industry-wide usage of performance benchmarks, such as SPECmarks, and the popularity of 3 0 graphics applications forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. This paper presents results of the Stanford subnanosecond arithmetic processor (SNAP) research efsort in the design of hardware for floating point addition, multiplication and division. We show that one cycle FP addition is achievable 32% of the time using a variable latency algorithm. For multiplication, a binary tree is often inferior to a Wallace-tree designed using an algorithmic layout approach for contemporary feature sizes (0.3pm). Furthel; in most cases two-bit Booth encoding of the multiplier is preferable to non-Booth encoding for partial product generation. It appears that for division, optimum area-performance is achieved using functional iteration, and we present two techniques to further reduce average division latency.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Stuart F. Oberman

NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Standard for Floating-Point Arithmetic

High-speed function approximation using a minimax quadratic interpolator

Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor

Design issues in division and other floating-point operations

SRT division architectures and implementations

AMD 3DNow! technology: architecture and implementations

The SNAP project: design of floating point arithmetic units

Contact Info

Product

Resources

About