This paper presents a novel variable-latency multiplier architecture, suitable for implementation as a self-timed multiplier core or as a fully synchronous multi-cycle multiplier core. The architecture combines a 2 nd order Booth algorithm with a split carry save array pipelined organization, incorporating multiple row skipping and completion-predicting carry-select final adder. The paper reports the architecture and logic design, CMOS circuit design and performance evaluation. In 0.35 µm CMOS, the expected sustainable cycle time for a 32-bit synchronous implementation is 2.25 ns. Instruction level simulations estimate 54% single-cycle and 46% two-cycle operations in SPEC95 execution. Using the same CMOS process, the 32-bit asynchronous implementation is expected to reach an average 1.76 ns throughput and 3.48 ns latency in SPEC95 execution. I. INTRODUCTION Fast integer multipliers are a key topic in the VLSI design of high-speed microprocessors. Recent results have shown that through a careful full-custom CMOS design a 54x54 bit multiplication in less than 3 ns is possible [21]. However, with commonly available CMOS processes, micro-architectures with 2 ns cycle time are commercially available [28]. As a result, due to the registers' setup and hold times, even a fast 32 bit multiplication may not fit in a single cycle, and the design of pipelined multi-cycle multipliers is a common design choice to avoid the whole microarchitecture be limited by a relatively slow multiplier. Data dependency always puts a limitation to the throughput of pipelined arithmetic units [22], due to idle cycles between consecutive dependent operations. To overcome this, synchronous variable-latency pipelined addition units have recently been proposed in DSP industrial design [30]. A variable latency unit operates as a normal pipelined unit, but for most operands it can complete its operation in a single cycle, thus avoiding idle cycles insertion and improving the average throughtput. A synchronous signal flags in which cycle the operation has completed. A more aggressive implementation of this idea is inherent in asynchronous design, with self-timed units capable of an average response faster than the worst case [6][9][14] [25][29][39][52].