Advances in nanoelectronic fabrication have enabled integrated circuits to operate at a high frequency. The finite impulse response (FIR) filter needs only to meet real-time demand. Accordingly, increasing the FIR architecture's folding number can compensate the high-frequency operation and reduce the hardware complexity, while continuing to allow applications to operate in real time. In this work, the folding scheme with integrating input-data and tap folding is proposed to develop a hardware-efficient programmable FIR architecture. With the use of the radix-4 Booth algorithm, the 2-bit input subdata approach replaces the conventional 3-bit input subdata approach to reduce the number of latches required to store input subdata in the proposed FIR architecture. Additionally, the tree accumulation approach with simplified carry-in bit processing is developed to minimize the hardware complexity of the accumulation path. With folding in input data and taps, and reduction in hardware complexity of the input subdata latches and accumulation path, the proposed FIR architecture is demonstrated to have a low hardware complexity. By using the TSMC 0.18 µm CMOS technology, the proposed FIR processor with 10-bit input data and filter coefficient enables a 128-tap FIR filter to be performed, which takes an area of 0.45 mm 2 , and yields a throughput rate of 20 M samples per second at 200 MHz. As compared to the conventional FIR processors, the proposed programmable FIR processor not only meets the throughput-rate demand but also has the lowest area occupied per tap. Hardware complexity Real time Computational performance Conventional architectures with fixed folding number Architectures with capability of increasing folding number Programmable processors Circuit speed