To move High-Performance Computing (HPC) closer to forward operating environments and missions, the Army Research Laboratory is developing approaches using hybrid, asymmetric core computing. By blending capabilities found in Graphics Processing Units (GPUs) and traditional von Neumann multicore Central Processing Units (CPUs), approaches are being developed and optimized to provide at or near real-time processing speeds for research project applications. Algorithms are designed to partition work to resources best designed to handle the processing load. The use of commodity resources allows the design to be flexible throughout the life cycle without the costly and time-consuming delays associated with Application-Specific Integrated Circuit (ASIC) development. This paradigm allows for rapid technology transfer to end users. In this paper, we describe a synchronous impulse reconstruction radar imaging algorithm that has been designed for hybrid CPU-GPU processing. We discuss various optimizations such as asynchronous task partitioning between the CPU and GPU as well as data movement reduction. We also discuss analysis and design of the algorithms within the context of two programming models: NVIDIA's CUDA and AMD's ATI Brook+. Finally, we report on the speedup achieved by this approach that allowed us to take a code once restricted to postprocessing and transform it into one that exceeds realtime performance requirements.
A l M b x 1 ECL SRAM fabricated with a 0.8pm BiClLlOS technology has 8ns access time and is 10K IjO compatible. To achieve sub-l0ns address access time and low-power consumption, an ECL-CMOS level converter, a bit-line peripheral circuit and an automatic power saving function are employed.The chip architecture is shown in Figure 1. Inputs are received by an ECL input buffer and translated to CMOS levels, and address decoding is executed. The cell array consists of 512 rows by 2048 columns, and is divided into 16 sections. Each section has 128 columns and four local amplifiers, allowing for conversion to a 4b-wide configuration. To reduce both the word-line delay and the active power, a modulated double word-line structure was adopted'. Only one section is activated at a time by a section word-line (SWL) which is selected using NOR gate by a main word-line (MWL) and one of the four section selection lines. This structure can relax the pitch of the main word-line driver and can also relax bipolar transistor size. The polysilicon section word-line is connected to the aluminum section word-line every 16 cells. The total word-line delay is less than Ins. The 4b-wide global data are multiplexed and output by an ECL buffer. Figure 2 shows the ECL input buffer and ECL-CMOS level converter. The output of this buffer is directly converted to CMOS level without ECL predecoding to reduce power consumption. The converter consists of an NMOS dual cross-coupled-latch and two PMOS FETs and the reference voltage of Vbb-Vbe-Vtp is applied to PMOS gates for detecting input-buffer output levels. The complementary outputs Ai*, Ai* can be available simultaneously because of the symmetrical geometry of the converter. Thus, the converter is suitable for address buffer. The output of the converter supplies CMOS levels with no dc current.A BiCMOS bit-line peripheral circuit, illustrated in Figure 3, is used to minimize srnsc delay. The hit-lint. voltage s\ting is limited to about 5OmV h i a norm all^ -on bit-line equalization circuit.where bit-line equalizing transistors are normally activated during a read cycle. Thus, bit-line recovery time during data switching is reduced. The access time advantage of using the normally-on bitline equalization circuit is about 30%. During write operation the equalization transistors are cut-off. The bit-line voltage of Vcc-2Vbe is generated by a Darlington transistor'. A PMOS load is inserted 'Sakurai. T . , et. al., "A Low Power 46ns 256Kbit CMOS Static RAM with Dynamic Double Word Line". IEEE between bit-line voltage source and bit-line pairs. A two-stage sensing circuit, with bipolar differential pair, is used.An automatic power saving (APS) function utdizing an address transition detectioir . (ATD) technique is applied to the ECL SRAM in order to reduce power consumption during read cycle' '. The cell arm)-and first sense amplifiers are activated b!signal,@ApS which is generated from ~h c AIYI' pulse. arid is used onl) for activation, not for equilibration. k'igure .b shows a circuit diagram of a s...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.