Cryogenic, superconducting digital processors offer the promise of greatly reduced operating power for server-class computing systems. This is due to the exceptionally low energy per operation of Single Flux Quantum circuits built from Josephson junction devices operating at the temperature of 4 Kelvin. Unfortunately, no suitable same-temperature memory technology yet exists to complement these SFQ logic technologies. Possible memory technologies are in the early stages of development but will take years to reach the cost per bit and capacity capabilities of current semiconductor memory. We discuss the pros and cons of four alternative memory architectures that could be coupled to SFQ-based processors. Our feasibility studies indicate that cold memories built from CMOS DRAM and operating at 77K can support superconducting processors at low cost-per-bit, and that they can do so today.
No abstract
Rapidly evolving workloads and exploding data volumes place great pressure on data-center compute, IO, and memory performance, and especially on memory capacity. Increasing memory capacity requires a commensurate reduction in memory cost per bit. DRAM technology scaling has been steadily delivering affordable capacity increases, but DRAM scaling is rapidly reaching physical limits. Other technologies such as Flash, enhanced Flash, Phase Change Memory, and Spin Torque Transfer Magnetic RAM hold promise for creating high capacity memories at lower cost per bit. However, these technologies have attributes that require careful management. We propose a hybrid DIMM architecture that uses a hardwaremanaged DRAM in front of enhanced Flash, which has much lower read latencies than conventional Flash. We explore the design space of such SCM devices in the context of different technology parameters, evaluating performance and endurance for data-center workloads. Our hybrid memory architecture is commercially realizable and can use standard DIMM form factors, giving it a low barrier to market entry. We find that for workloads like media streaming, enhanced Flash can be combined with DRAM to enable 88% of the performance of a DRAM-only system of the same capacity at 23% of the cost, even when factoring in replacement costs due to wear-out. The bottom line is that cost per performance is a factor of 3.8 better than DRAM. 1. INTRODUCTION Data-center servers struggle to keep up with rapidly evolving workloads and exploding data volumes. For many Big Data workloads, DRAM capacity is as important as compute, IO, and
RECENTLY WE completed a set of monolithic floating point processors. The set is implemented in a CMOS on sapphire process with four micron feature sizes. There are three processors in the set: an add/subtract chip, a multiply chip, and a divide chip.The primary design goal for the chip set was maximum scalar (single operation) performance, which was achieved by using combinational data path logic. It was also necessary to perform three-chip partitioning so that each of the fundamentally different floating point operations could be optimized with its own data path. The chip set has about 20 to 30 times the performance of commercially-available 64b floating-point processors.A second design goal was to maximize the ease-of-use of the chip set. The chip set has the following features: Identical pin assignments (64 pins/chip), simple control requirements, static, single-clock edge operation, TTL compatible inputs and outputs, about 400mW per chip, three 16b, 12MHz data buses, tristate output pins. The chip set handles the 32b and 64b floating point and 32b fixed point data types of a minicomputer*. Operations provided include add, subtract, multiply, divide, data type conversion, and N-bit shifting. Figure 1 shows a photograph of the add/subtract chip. The two operands are loaded into the exponent and fraction registers in two or four clock cycles (for 32 or 64b operations). The exponents are compared, and the fraction with the smaller exponent is right shifted so the fractions are aligned. The fractions are added, and the result is normalized by a right shift (operands have the same sign) or by an N bit left shift (different operand signs). The result exponent is corrected for the post normalization. The result fraction is then rounded to the proper precision, and the exponent checked for overflow or underflow. If either has occurred, the appropriate constants are forced. The result is unloaded in two or four clock cycles. It will be noted that the operand loading and result unloading is performed synchronously with respect to the external system. Propagation through the combinational data path logic is asynchronous, and the necessary time delay is provided by an integral number of system clock cycles. The propagation delay is in the range of 400 to 600ns, depending upon the operation.A photograph of the multiply chip appears in Figure 2. When the operands are loaded, a modified Booth encoding is performed on the multiplier fraction. This reduces the number of full adders required in the combinational array by one half, and reduces the propagation delay of the array by the same factor. When the carry save result of the full adder array settles, it is sign corrected and converted t o carry propagate form. The -*HP1000 Chairman: Peter J. Verhofstadt Fairchild Camera/lnstr. Corp. Santa Clara, CAresult fraction is normalized, and the result exponent (the sum of the operand exponents) is conditionally incremented. The fraction is rounded t o the proper precision, and the exponent checked for overflow or underflow (with th...
A set of two VLSI circuits well-suited for digital signal processing is described which provides the complete 32 bit floating-point multiplier and adder functions. The data lorniat conforms with the new IEEE P754 standard. Operations include multiplication, add, subtract, conversion to and from 24 bit integer numbers and absolute value.Multiply and add times are both 600 nsec in a flow through manner. This is reduced to 200 nsec when operated in a three stage pipeline manner using internal registers. Both chips are fabricated in high-speed NMOS with 3 micron minimum feature size, which results in low power consumption of 1.5 watts typically. They are packaged in 64-pin dual-in-line package and 68-pin 1 eadi ess-chip-carrier. The applications of this chip set in FFT, digital filtering and array processing are described in this paper. * WTL1O32/WTL1O33 HARDWARE FOR FLOATING-POINTUp until now, the digital signal processing system designer has had -few attractive ways to use floatingpoint processing. Commercially available array processors are expensive. They are either of coprocessor type like INTEL 8087, which is too slow, or special purpose processor type which is bulky and power hungry. WTL1O32/1033 handle the floating-point data path components necessary for the 32 bit IEEE standard in high density, low power NMOS with the speed, architecture and control mechanisms so that they can be used in highspeed real-time signal processors. (1) A NEW VLSI CHIP SET A common block diagram for the two VLSI chips is shown in Figure 1. For the floating-point multiplier the array is a significand multiplier and exponent adder. For the floating-point ALU the array is a denormalizer, significand adder and re-normalizer. Both have the same register structure and pinouts so that loading and unloading of data, function, modes of operation and status are uniform. As the figure shows, the array can be divided into three additional stages separated by registers so that data can be pipelined through, in effect tripling the speed.16.6.1The function codes are also pipelined such that the data flow can continue even when the function is changing. Input and outputAll inputs and outputs, both data and control, are fully registered on the chip. These along with three-state outputs make for ease of use in bu oriented systems. The two 32 bit input operands and the output result are time multiplexed through 15 pins so that standard 54 and 68 pin packages could be used. Inputs and ouputs can be clocked at twice the pipe-lined rate so this timesharing is invisible. The loading and unloading of the array is controlled by separate registers where the output, like the function control, is pipelined so it remains with the pertinent data. The input operands can be loaded individually so if one remains constant bus traffic is reduced. Modes of operationThe IEEE standard specifies not only the floating-point format but the detection and treatment of exceptions (such as underflow) and choices in the procedure for rounding and representation of infinity. A...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.