Asymptotic limits of video signal processing architectures

Dutta, Santanu; Wolf, Wayne

doi:10.1109/76.475897

Cited by 5 publications

(1 citation statement)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(2) Similarly, the data forwarding involved in clock c3 line and data bus d2 has more timing margin, due to longer wire-lengths, than that for previous c1 and d1 . (3) Between the two clocks c2 and c4, c4 has a longer delay than c2, due to longer physical wire for c3 which i s about half of the chip width Thus c4 needs to be fed to the preceding unit and c2 is terminated, Figure 15: Subband lterbank chip layout as previously described for pipeline fork and join connection method. (4) The data forwarding by the d3 data bus and c4 clock (to the line memory unit I) has timing margin realized through wire delays The data forwarding by d3 data bus and c2 clock (to the line memory unit II) has more timing margin than that due to temporally advanced clock c2. There are two other noteworthy things about the dsign: the double frequency clock ( 2f clk) feeding to a particular location of the chip and 12-bit connections to another chip. The 2f clk needs timing adjustment to align the clock with the c5 clock.…”

Section: A Practical Assessment Of C 2 Pipeliningmentioning

confidence: 99%

Timing constraints for high-speed counterflow-clocked pipelining

Yoo

Gopalakrishnan

Smith

1999

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

With the escalation of clock frequencies and the increasing ratio of wire-to gate-delays, clock skew is a major problem to be overcome in tomorrow's high-speed VLSI chips. Also, with an increasing number of stages switching simultaneously comes the problem of higher peak power consumption. In our past work, we have proposed a novel scheme called Counter ow-Clocked(C 2 ) Pipelining to combat these problem, and discussed methods for composing C 2 pipelined stages. In this paper, we analyze, in great detail, the timing constraints to be o b eyed in designing basic C 2 pipelined stages as well as in composing C 2 pipelined stages. C 2 pipelining is well suited for systems that exhibit mostly uni-directional data ows as well as possess mostly nearest-neighbor connections.We illustrate C 2 pipelining on such a design with several design examples. C 2 pipelining eases the distribution of high speed c l o cks, shortens the clock period by eliminating global clock signals, allows natural use of level-sensitive dynamic latches, and generates less internal switching noise due to the uniformly distributed latch operation. By applying C 2 pipelining and its composition methods to build a system, VLSI designers can substitute the global clock skew problem with many local one-sided delay constraints.With the escalation of clock frequencies and the increasing ratio of wire-to gate-delays, clock skew is a major problem to be overcome in today's high-speed VLSI chips. Clock s k ew should ideally be less than 5-10% of the system clock cycle time 1] this is a di cult gure to attain in many modern chips 2] and will become more so with the impending GHz rate of clocking 3]. The e ect of shrinking VLSI feature sizes will increase this disparity 4] in the future, especially in the light of the fact that in submicron CMOS, interconnection delays are going to be larger than gatepropagation delays 5]. Consequently, an increased percentage of the clock period will be devoted to clock s k ew margins 6, 7]. The faster the clock and the bigger the die size, the worse the clock skew e ects will be.A major concern when building high performance VLSI systems is to build an e ective clock distribution network. Many clock distribution methods for large high-speed VLSI chips have b e e n developed 1] t o a c hieve rigid synchronization (tight s k ew control) over the chip. Clock distribution networks of high-speed systems are normally comprised of binary trees of clock bu ers 2, 8], which are expensive to produce in terms of area and design time. Network implementations such a s H-tree methods 7] h a ve been commonly exploited to reduce the clock s k ew. The e ort to limit skews has an unfortunate side-e ect: it causes the latches to switch almost simultaneously, causing ground-bounce and power-supply-droop, both of which can lead to chip malfunction. This often necessitates on-chip and o -chip decoupling capacitors 1], both of which add to the design cost.Rigidly clocked synchronous systems are often those that support a variety of data movements b...

show abstract

Section: A Practical Assessment Of C 2 Pipeliningmentioning

confidence: 99%

Timing constraints for high-speed counterflow-clocked pipelining

Yoo

Gopalakrishnan

Smith

1999

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

show abstract

A methodology to evaluate memory architecture design tradeoffs for video signal processors

Dutta

Wolf²,

Wolfe³

1998

IEEE Trans. Circuits Syst. Video Technol.

Self Cite

View full text Add to dashboard Cite

This paper develops a methodology for the design of the memory and the memory-processor communication network in video signal processors. The memory subsystem is the bottleneck of most video computing systems and its design requires evaluating tradeoffs between area, cycle time, and utilization. We emphasize the need to consider technological and circuit-level issues during the design of a system architecture, particularly video signal processing (VSP) systems, and present a systematic method whereby the organization of the memory architecture-the granularity of memory partitioning and the size and type of interconnection network-can be analyzed and its cycle-time approximated before a detailed design is undertaken. We show how variations in sizes and circuit configurations help determine the variations in delay of both memory and network, and how the delay curves, thus determined, can be used to design, compare, and choose from different memorysystem architectures; we also describe a technique that can be used to identify the on-chip-off-chip boundary with respect to a hierarchical memory-system design for a memory-intensive VSP module. All of our results are validated via layout and simulation of prototype circuits in two different process technologies. Motion estimation and discrete cosine transform (DCT) being two of the most important tasks in video processing, we use the design of a motion estimator and that of a DCT unit as examples to illustrate the high-level issues in designing the memory architecture for a VSP module. The analysis presented for the motion estimator and the DCT unit can also be applied to other processing blocks belonging to the system. Index Terms-Circuit simulation, hierarchical memory architecture, memory bank conflict, multiport memory, multistage interconnection network.

show abstract