A double-precision multiplier for floating-point and mediastreaming instructions in the first-generation CELL processor [1] on 90nm PD/SOI is reported. Multiplication by recoding and successive partial-product (PP) compression is completed in three 11FO4 cycles including merging with the aligner. Figure 20.3.3 shows the micro-architecture of the design. At 1.3V and 68°C, hardware runs at 4.76GHz (Fig. 20.3.1). The multiplier area is 0.19mm 2 including that of decoupling capacitors. Only regular-V t devices are used in consideration of variability, leakage, and scalability. Other noted high-speed design points in the 90nm technology are the single precision [2] and low FO4 double-precision [3] multipliers.The first cycle starts with Radix-4 Booth logic whose inputs are two 53b operands. Booth circuits reduce the number of PP rows to 27. To minimize area and latch count, two levels of 3:2 compressions in transmission-gate (TG) style circuits are also performed in this cycle. Footless domino circuits are used for complex Booth encoding and muxing functions. Figure 20.3.4 depicts a pruned schematic diagram for the Booth encoder, Booth multiplexer (MUX), and pulse-to-static converter latch.Static cycle 2 and 3 start with low-latency pulse latches (12 unfolded and 8 folded PP rows, respectively) to maximize cycletime utilization and minimize clock power. Cycle 2 contains thirdlevel 4:2 compressors (CMPs) and fourth-level 3:2 CMPs. In the third cycle, the fifth-level 4:2 CMP outputs are merged with the outputs from aligners in the final 3:2 CMPs. To ensure noise immunity, no unbuffered TGs are used. Delay is reduced through customized connections between two compression levels such that the number of inversions in any given path is minimized. Interconnect penalties are minimized by splitting the wiring between the second (row folding wires) and third (buses over the aligner) cycle. Figure 20.3.5 shows exemplary 3:2 and 4:2 TG CMPs.Input operand latches convert static inputs to clock-qualified signals for the domino stages. Booth encoders are placed in the central clock bay to minimize delay. Pulsed operand inputs to dynamic stages reduce contention current at various process and operating corners. The design tolerates 10% variation in system clock pulses, i.e., 40% evaluate or precharge duty cycle, thus enhancing the technology and frequency scalability. Besides PFET keepers for dynamic nodes, clock gated NFET keeper devices are incorporated to sustain the low state, thus allowing low-speed testing and operations under short evaluate pulse conditions. Additionally, a pulse limiter on the clock grid limits evaluation time to 20FO4 at long cycle time. This avoids keeping dynamic nodes in the evaluate state for long periods of time. Higher leakage and smaller keepers can thus be tolerated without failure. Long Booth-encoder output wires and ladder-style Booth MUX input connections are shielded from noise. Dynamic output signals are converted to static ones with a mid-cycle converter latch whose input clock is delay inte...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.