JPEG 2000 uses two key components, the discrete wavelet transform (DWT) and embedded block coding with optimized truncation (EBCOT), to achieve excellent coding efficiency and numerous features, such as scalability and region of interest. The scalability comes from the multi-level decompositions of the DWT and the embedded block coding (EBC) of the EBCOT.There are three challenges in the design of an efficient JPEG 2000 codec for HD video. Firstly, the large data rate between the DWT and the EBCOT requires either large on-chip SRAM or high SDRAM bandwidth. Secondly, complicated control and irregular dataflow of the DWT and the EBC require large implementation area to meet the high throughput requirements. Thirdly, hardware sharing between the encoder and the decoder is difficult due to different computational characteristics and dataflow. This paper presents efficient techniques to overcome the above challenges. Figure 22.2.1 shows the features of the 1920×1080 motion-JPEG 2000 codec chip we have developed. The core size is 20.1mm 2 with 0.18µm CMOS technology. It contains 1155k logic gates and 19.9kB of SRAM. The power consumption is 385mW at 1.8V and 42MHz. Figure 22.2.2 shows the block diagram of the codec chip. It contains a Main Controller, a 3-level DWT module, three embedded block coding (EBC) modules, a rate-distortion optimization (RDO) controller, and a bit stream controller (BSC). The RDO controller maximizes image quality at a given target bit rate. Both the DWT and the EBC are pixel-pipelined so that no tile memory is required between the DWT and the EBC. Moreover, both the encoding and the decoding are one-pass, that is, no coefficient transmission to or from the SDRAM is required.There are two problems that make a pixel-pipelined architecture challenging. Firstly, the dataflow patterns of the DWT and the EBC are quite different; the DWT generates the coefficients in a subband-interleaving manner while the EBC encodes a codeblock within one subband at a time. Secondly, the DWT is a wordlevel algorithm while the EBC is a bit-level one. Therefore, a tilelevel pipeline schedule is used in all previous known works. This introduces either high bandwidth for those storing tiles in the SDRAM [1] or high cost for those storing in on-chip SRAM [2]. In [1], the bandwidth requirement is so high that two buses are needed, and each operates at twice the frequency of the core. In [2], a 192kB SRAM is required for a 256×256 tile, which would occupy half of the silicon area.A level-switched schedule is proposed to solve the above two problems. It eliminates the 192kB on-chip SRAM for the architecture in [2] and reduces the 310MB/s SDRAM bandwidth for the architecture in [1]. The detailed encoding schedule for both the DWT and the EBC is shown in Fig. 22.2.3. Each computation state requires 256 cycles for encoding either one stripe of a 64×64 codeblock or two stripes of a 32×32 code-block. The tile memory is eliminated by encoding the stripes in multiple subbands/levels in an interleaved manner. Three EBC modules ma...