A high-throughput, area-efficient hardware accelerator for adaptive deblocking filter in H.264/AVC

Nadeem, Muhammad; Wong, Stephan; Kuzmanov, Georgi; Shabbir, Asghar

doi:10.1109/estmed.2009.5336814

Cited by 8 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There exist numerous applications for accelarators in both of the embedded and high performance computing markets. Examples include video processing [24], software-defined radio [5], network traffic management [19], DNA computing [17] and fully programmable hardware acceleration platforms [23]. Efficient sharing of data in a heterogeneous MpSoC which contains different types of integrated computational elements is a challenging task.…”

Section: Introductionmentioning

confidence: 99%

Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ

Sadri

Weis

Wehn

et al. 2013

Proceedings of the 10th FPGAworld Conference

View full text Add to dashboard Cite

Cooperation of CPU and hardware accelerator to accomplish computational intensive tasks, provides significant advantages in run-time speed and energy. Efficient management of data sharing among multiple computational kernels can rapidly turn into a complicated problem. The Accelerator coherency port (ACP) emerges as a possible solution by enabling hardware accelerators to issue coherent accesses to the memory space. In this paper, we quantify the advantages of using ACP over the traditional method of sharing data on the DRAM. We select the Xilinx ZYNQ as target and develop an infrastructure to stress the ACP and high-performance (HP) AXI interfaces of the ZYNQ device. Hardware accelerators on both of HP and ACP AXI interfaces reach full duplex data processing bandwidth of over 1.6 GBytes/s running at 125 MHz on a XC7Z020-1C device. The effect of background DRAM and cache traffic on the performance of accelerators is analyzed. For a sample image filtering task, the cooperative operation of CPU and ACP accelerator (CPU-ACP ) gains a speed-up of 1.2X over CPU and HP acceleration (CPU-HP ). In terms of energy efficiency, an improvement of 2.5 nJ (> 20%) is shown for each byte of processed data. This is the first work which represents detailed practical comparisons on the speed and energy efficiency of various processor-accelerator memory sharing techniques in a configurable heterogeneous platform.

show abstract

Section: Introductionmentioning

confidence: 99%

Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ

Sadri

Weis

Wehn

et al. 2013

Proceedings of the 10th FPGAworld Conference

View full text Add to dashboard Cite

show abstract

“…We proposed a novel decomposition of the filter kernels to remove the arithmetic operations redundancy in our previous work [17]. The proposed optimization of the filter equations reduces the total number of adder instances from 49 to 24 [17]. This more than double reduction of addition operations does not only pay off in terms of less area requirement for its implementation but also helps to reduce the signal activity in the combinatorial logic between different pipeline stages.…”

Section: Our Approach To Reduce the Dynamic Power Consumptionmentioning

confidence: 96%

“…(14) - (16) in Fig 3(c). Similarly, the overlapped data path for conditional filtering in strong and weak filtering modes is implemented in LumaCommonPBlock and LumaCommonQBlock whereas the LumaBs4_PBlock, LumaBs4_QBlock implements the rest of the processing for strong filter mode case [17]. In case of Strong or Weak Filter Modes for chroma component of the MB, one can see from Fig.…”

Section: Deblock Filter Core Modulementioning

confidence: 99%

“…In the original filter equations in H.264/AVC video coding standard [1], 49 addition operations along with 5 clip operations are required in total for full pipeline implementation of the algorithm. We proposed a novel decomposition of the filter kernels to remove the arithmetic operations redundancy in our previous work [17]. The proposed optimization of the filter equations reduces the total number of adder instances from 49 to 24 [17].…”

Section: Our Approach To Reduce the Dynamic Power Consumptionmentioning

confidence: 99%

See 1 more Smart Citation

Low-power, high-throughput deblocking filter for H.264/AVC

Wong

Kuzmanov

Shabbir

et al. 2010

2010 International Symposium on System on Chip

Self Cite

View full text Add to dashboard Cite

In this paper, we present a low-power, high-throughput hardware implementation of deblocking filter core in H.264/AVC for battery-powered multimedia electronic devices. The hardware implementation is based an optimized deblocking filter algorithm with 50% less number of addition operations. The evaluation of full or partial filtering skip scenarios is employed at an early stage in the filter processing chain to avoid un-necessary operations. Moreover, independent processing blocks are identified and are implemented with gated clock. Thus an efficient control block to activate/deactivate these independent processing blocks dynamically and pipeline implementation enable us to achieve low-power at one hand and high-throughput design for deblocking filter on the other. Experimental results suggest that the dynamic power consumption is reduced up to 50%, when compared with state-of-the-art designs in the literature. The deblocking filter core consumes 43 mW dynamic power on a Xilinx Virtex II FPGA and consumes 16.36 μW, when synthesized using 0.18μm CMOS standard cell library. The FPGA implementation on Virtex II can work at 76 MHz whereas the maximum operating frequency for 0.18μm process technology is 200 MHz. Our deblocking filter hardware implementation can easily provide real-time filtering operation for full-HD video format (1920×1080) @ 30 fps with an operating frequency as low as 59 MHz.

show abstract

“…A large number of accelerators utilize a single filter core, such as [4][5][6][7][8][9][10][11][12][13][14]. Most single filter based architecture [4][5][6] can operate at 100MHz and take around 200 cycles per macroblock, which can not satisfy high level requirement.…”

Section: Introductionmentioning

confidence: 99%

A high-throughput low-power hardware architecture for H.264 deblocking filter

Chen

Xia

2010

2010 2nd International Conference on Computer Engineering and Technology

View full text Add to dashboard Cite

In this paper we present a high throughput low power hardware architecture of deblocking filter for H.264/AVC. In order to enhance throughput, we propose fivestage pipeline filter core and novel double-filter architecture to process vertical and horizontal edges simultaneously. A novel parallel filtering order is adopted not only to eliminate structure hazard but also to efficiently reuse the intermediate data and reduce SRAM access times. In addition, our architecture utilizes clock gating schemes for both filter cores and transposes to further reduce power consumption. While working at clock frequency of 150MHz, synthesized under 0.13um CMOS standard cell technology, our design achieves the throughput of 1562kMB/s, which could easily meet the throughput requirement of all the levels in H.264/AVC video coding standard and the power consumption of 0.6μW per macroblock which is suitable for mobile applications.

show abstract

A high-throughput, area-efficient hardware accelerator for adaptive deblocking filter in H.264/AVC

Cited by 8 publications

References 18 publications

Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ

Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ

Low-power, high-throughput deblocking filter for H.264/AVC

A high-throughput low-power hardware architecture for H.264 deblocking filter

Contact Info

Product

Resources

About