Multithreading and multi-core processing have been shown to be powerful approaches for boosting a system performance by taking advantage of parallelism in applications. This paper presents a processor design by unifying RISC and multithreading DSP for the sophisticated multimedia applications with advanced standards such as H.264. The proposed design not only minimizes integration costs for embedded multithreading/multi-core design by independent coherent threads, but also reduces the memory bandwidth requirements by one-stop streaming buffer and a very fast data exchange mechanism. With the proposed techniques and appropriate programming model, we can achieve 78% reduction of memory bandwidth and 89% reduction of processing time in H.264 video encoding, compared to traditional single stream micro-processor.
Multithreading and multicore processing are powerful ways to take advantage of parallelism in applications in order to boost a system's performance. However, exploring sufficient parallelism and achieving data locality with low communication overhead are still important research issues in embedded multithreading/multicore design. This paper introduces the design of a fast data switching mechanism between multilevel storage structures in a new multicore architecture. This paper makes several contributions to the development of contemporary sophisticated multimedia applications with advanced standards such as H.264. The first contribution, collaborative-multithreading, tightly unifies reduced instruction set computer and collaborative multithreading digital signal processing (DSP) in order to exploit high parallelism to provide sufficient computing power to applications. Each collaborative thread of our DSP is constructed by a heterogeneous-simultaneously multithreading single instruction, multiple data structure, and four media processing cores, which is connected by a fast switch for providing a fast data exchange mechanism among correlative streams on a thread-level basis. Our second contribution is one-stop streaming processing, which aims to keep data in the system for as long as possible until it is no longer needed, thus making data more efficient to access. Our third contribution is a chunk threading programming model, including a thread management library and threading communication directives for reducing data communication and synchronization overhead. By a combination of coarse-grained and fine-grained threading, programmers can choose various threading levels based on the amount of data exchange in a program. With our proposed techniques and an appropriate programming model, we can reduce processing time by 54.9% in H.264 video encoding (common intermediate format video at 16.574 f/s) with the 1-virtual independent and streaming processing by open collaborative multithreading configuration, compared to the Texas Instruments C62 core that owns 8 function units. We realize our design as a prototype by chip implementation, and fabricate it as a chip based on the Taiwan Semiconductor Manufacturing Company Ltd. 0.13 µm process. The die size of the processor core is 16.12 mm 2 , including 414k logic transistors and 34.4 kB of on-chip static random access Manuscript
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.