Abstract-Instruction memoization is a promising technique to reduce the power consumption and increase the performance of future low-end/mobile multimedia systems. Power and performance efficiency can be improved by reusing instances of an already executed operation. Unfortunately, this technique may not always be worth the effort due to the power consumption and area impact of the tables required to leverage an adequate level of reuse. In this paper, we introduce and evaluate a novel way of understanding multimedia floating-point operations based on the fuzzy computation paradigm: Performance and power consumption can be improved at the cost of small precision losses in computation. By exploiting this implicit characteristic of multimedia applications, we propose a new technique called tolerant memoization. This technique expands the capabilities of classic memoization by associating entries with similar inputs to the same output. We evaluate this new technique by measuring the effect of tolerant memoization for floating-point operations in a low-power multimedia processor and discuss the trade-offs between performance and quality of the media outputs. We report energy improvements of 12 percent for a set of key multimedia applications with small LUT of 6 Kbytes, compared to 3 percent obtained using previously proposed techniques.Index Terms-Low-power design, special-purpose and application-based systems, real-time and embedded systems.
The focus of this paper is on adding a vector unit to a superscalar core, as a way to scale current state of the art superscalar processors.The proposed architecture has a vector register file that shares functional units both with the integer datapath and with the floatingpoint datapath. A key point in our proposal is the design of a high performance cache interface that delivers high bandwidth to the vector unit at a low cost and low latency. We propose a double-banked cache with alignment circuitry to serve vector accesses and we study two cache hierarchies: one feeds the uector unit from the Ll; the other from the L.2. Our results show that large IPC values (higher than IO in some cases) can be achieved. Moreover the scalability of our architecture simply requires addition of functional units, without requiring more issue bandwidth. As a consequence, the proposed vector unit achieves high performance for numerical and multimedia codes with minimal impact on the cycle time of the processor or on the performance of integer codes.
Absfract-I n this paper we address the design of a packet buffer for future high-speed routers that support line rates as high as OC-3072 (160 Gbls), and a high number of ports and service classes.We describe a general design for hybrid DRAWSRAM packet buffers that exploits bank organization of DRAM. This general scheme includes some designs previously proposed as particular cases.Based on this general scheme we propose a new scheme that randomly chooses a DRAM memory bank for every transfer hetaeen SRAM and DRAM. The numerical results show that this scheme nould require an SRAM size almost an order of magnitude lower than previously proposed schemes without the problem of memory fragmentation.
The recent y ears ha ve s h o wn an interesting ev olution in the mid-end to low-end embedded domain. Portable systems are growing in importance as they improve in storage capacity and in interaction capabilities with general purpose systems. F urthermore, media processing is changing the view embedded processors are designed, keeping in mind the emergence of new application domains such as those for PDA s y s t e m s or for the third generation of mobile digital phones (UMTS).The performance requirements of these new kind of devices are not those of the general-purpose domain, where traditionally the premium goal is the highest performance. Embedded systems must face ev er increasing real time requirements as we l l a s p o w er consumption constraints. Under this special scenario, instruction/region reuse arises as a promising way of increasing the performance of media embedded processors and, at the same time, reducing the power consumption. F urthermore, media and signal processing applications are a suitable target for instruction/region reuse, giv en the large amount of redundancy found in media data w orking sets.In this paper w e propose a no velregion reuse mechanism that tak es adv an tage of the tolerance of media algorithms to losses in the precision of computation. By identifying regions of code where an input data set is processed into an output data set, w e can reuse computational instances using the result of previous ones with a similar input data set (hence the term tolerant reuse). We w i l l s h o w that conventional region reuse is barely able to provide more than a 8% in reduction of executed instructions (even with signi cantly big tables) in a typical JPEG encoder application. On the other hand, when applying the concept of tolerance, we a r e able to provide a reduction of more than 25% of the number of executed instructions with tables smaller than 1KB (with only small degradations in the quality of the output image), and up to a 40% of reduction (and no visually perceptible di erences) with bigger tables .
Abstract-In order to support the enormous growth of the Internet, innovative research in every router's subsystem is needed. In this paper we focus our attention on packet buffer design for routers supporting high-speed line rates. More specifically, we address the design of packet buffers using Virtual Output Queuing (VOQ) discipline, which are used in most modern router architectures. The design is based on a previously proposed scheme that uses a combination of SRAM and DRAM modules. We propose a storage scheme that achieves a conflict-free memory bank organization. This leads to a reduction of the granularity of DRAM accesses, resulting in a decrease of storage capacity needed by the SRAM. In the DRAM/SRAM scheme, SRAM memory bandwidth needs to fit the line rate. Since memory bandwidth is limited by its size, searching for memory schemes having a small SRAM size arises as an essential issue for high speed line rates (e.g. OC768, 40 Gbps and OC3072, 160 Gbps).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.