Media-processing applications, such as signal processing, 2D-and 3D-graphics rendering, and image and audio compression and decompression, are the dominant workloads in many systems today. The real-time constraints of media applications demand large amounts of absolute performance and high performance densities (performance per unit area and per unit power). Therefore, mediaprocessing applications often use specialpurpose (custom), fixed-function hardware. General-purpose solutions, such as programmable digital signal processors (DSPs), offer increased flexibility but achieve performance density levels two or three orders of magnitude worse than special-purpose systems.One reason for this performance density gap is that conventional general-purpose architectures are poorly matched to the specific properties of media applications. These applications share three key characteristics. First, operations on one data element are largely independent of operations on other elements, resulting in a large amount of data parallelism and high latency tolerance. Second, there is little global data reuse. Finally, the applications are computationally intensive, often performing 100 to 200 arithmetic operations for each element read from off-chip memory.Conventional general-purpose architectures don't efficiently exploit the available data parallelism in media applications. Their memory systems depend on caches optimized for reducing latency and data reuse. Finally, they don't scale to the numbers of arithmetic units or registers required to support a high ratio of computation to memory access. In contrast, special-purpose architectures take advantage of these characteristics because they effectively exploit data parallelism and computational intensity with a large number of arithmetic units. Also, special-purpose processors directly map the algorithm's dataflow graph into hardware rather than relying on memory systems to capture locality.Another reason for the performance density gap is the constraints of modern technology. Modern VLSI computing systems are limited by communication bandwidth rather than arithmetic. For example, in a contemporary 0.15-micron CMOS technology, a 32-bit integer adder requires less than 0.05 mm 2 of chip area. Hundreds to thousands of these arithmetic units fit on an inexpensive 1-cm 2 chip. The challenge is supplying them with instructions and data. General-purpose processors that rely on global structures such as large multiported register files to provide
Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image understanding, require arithmetic rates of up to 10 11 operations per second. As the number of arithmetic units in a processor increases to meet these demands, register storage and communication between the arithmetic units dominate the area, delay, and power of the arithmetic units. In this paper we show that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance. We develop a taxonomy of register architectures by partitioning across the data-parallel, instruction-level parallel, and memory hierarchy axes, and by optimizing the hierarchical register organization to operate on streams of data. Compared to a centralized global register file, the most compact of these organizations reduces the register file area, delay, and power dissipation of a media processor by factors of 195, 20, and 430, respectively. This reduction in cost is achieved with a performance degradation of only 8% on a representative set of media processing benchmarks. 1. This assumes that the register file is backed by a cache with a hit time of one cycle. If the register file must cover memory (or cache) latency of more than a few cycles, then the register file always dominates the area of the ALUs, dominates latency for more than 22 ALUs, and dominates power dissipation for more than 2 ALUs.
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to conventional microprocessor architectures, they are a good fit for modern VLSI technology with its high arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams of data records passing through computation kernels, exposes both parallelism and locality in media applications that can be exploited by VLSI architectures. The Imagine architecture supports the stream programming model by providing a bandwidth hierarchy tailored to the demands of media applications. Compared to a conventional scalar processor, Imagine reduces the global register and memory bandwidth required by typical applications by factors of 13 and 21 respectively. This bandwidth efficiency enables a single chip Imagine processor to achieve a peak performance of 16.2GFLOPS (single-precision floating point) and sustained performance of up to 8.5GFLOPS on media processing kernels.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.