Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage

Bahtat, Mounir; Belkouch, Saïd; Elleaume, Philippe; Gall, Philippe

doi:10.1186/s13634-016-0336-0

Cited by 9 publications

(5 citation statements)

References 33 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, thanks to pipelining, the CPPR2B of our design is very close to 1 when performing NTT incessantly. When four NTTs are performed successively, CPPR2B of our design is 1.27, which is better than the 1.4 of the VLIW processor [21]. Our design has an advantage in applications which need to perform NTT successively, such as CRYSTALS-Dilithium.…”

Section: Cppr2b =mentioning

confidence: 91%

See 1 more Smart Citation

High-Throughput Polynomial Multiplier Architecture for Lattice-Based Cryptography

Shimada

Ikeda

2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

We propose a polynomial multiplier for lattice-based cryptography that achieves a throughput of 24.2 times higher than the state-of-the-art design. We have optimized the proposed architecture for ASIC implementation, instead of FPGA or CPU implementation. We employed shift register to reorder values to avoid complex memory accesses, and we realize complete pipeline operation for higher throughput. Also, we show that raising the degree of parallelism in this design increases throughput per area. This work will lead to the acceleration of Ring-LWE and Module-LWE-based cryptography, which attracts much attention for its resistance to quantum computers and applications in fully homomorphic encryption (FHE).

show abstract

Section: Cppr2b =mentioning

confidence: 91%

“…Bahtat proposed the efficient scheduling way for FFT in VLIW processors [21], and measured his design by the number of cycles per pseudo radix-2 butterfly(CPPR2B). In our 2stream design, Thus, the proposed design focuses on throughput instead of latency as compared to the previous designs.…”

Section: Comparison With Other Designsmentioning

confidence: 99%

High-Throughput Polynomial Multiplier Architecture for Lattice-Based Cryptography

Shimada

Ikeda

2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

show abstract

“…The second line of research focused on task scheduling within the conventional FFT implementations to speed up computation over specialized hardware. For example, [14] improved the butterfly task scheduling in a very-long-instructionword (VLIW) digital signal processors (DSP) chip using a software pipelining technique called modulo scheduling. This scheduling algorithm exploits the instruction-level parallelism (ILP) feature in the VLIW DSP platform to schedule multiple loop iterations in an overlapping manner [15].…”

Section: B Related Workmentioning

confidence: 99%

Efficient FFT Computation in IFDMA Transceivers

Du¹,

Liew²,

Shao³

2022

Preprint

View full text Add to dashboard Cite

Interleaved Frequency Division Multiple Access (IFDMA) has the salient advantage of lower Peak-to-Average Power Ratio (PAPR) than its competitors like Orthogonal FDMA (OFDMA). A recent research effort put forth a new IFDMA transceiver design significantly less complex than conventional IFDMA transceivers. The new IFDMA transceiver design reduces the complexity by exploiting a certain correspondence between the IFDMA signal processing and the Cooley-Tukey IFFT/FFT algorithmic structure so that IFDMA streams can be inserted/extracted at different stages of an IFFT/FFT module according to the sizes of the streams. Although the prior work has laid down the theoretical foundation for the new IFDMA transceiver's structure, the practical realization of the transceiver on specific hardware with resource constraints has not been carefully investigated. This paper is an attempt to fill the gap. Specifically, this paper puts forth a heuristic algorithm called multi-priority scheduling (MPS) to schedule the execution of the butterfly computations in the IFDMA transceiver with the constraint of a limited number of hardware processors. The resulting FFT computation, referred to as MPS-FFT, has a much lower computation time than conventional FFT computation when applied to the IFDMA signal processing. Importantly, we derive a lower bound for the optimal IFDMA FFT computation time to benchmark MPS-FFT. Our experimental results indicate that when the number of hardware processors is a power of two: 1) MPS-FFT has near-optimal computation time; 2) MPS-FFT incurs less than 44.13% of the computation time of the conventional pipelined FFT.

show abstract

“…Moreover, it provides a maximum performance of 128 GFLOPS for a single precision floating point calculation [4]. In addition, several research communities have developed high-performance computing systems using the C6678 DSP [3,[5][6][7][8][9].…”

Section: Introductionmentioning

confidence: 99%

Parallel implementation of pulse compression method on a multi-core digital signal processor

Klilou

2020

IJECE

View full text Add to dashboard Cite

Pulse compression algorithm is widely used in radar applications. It requires a huge processing power in order to be executed in real time. Therefore, its processing must be distributed along multiple processing units. The present paper proposes a real time platform based on the multi-core digital signal processor (DSP) C6678 from Texas Instruments (TI). The objective of this paper is the optimization of the parallel implementation of pulse compression algorithm over the eight cores of the C6678 DSP. Two parallelization approaches were implemented. The first approach is based on the open multi processing (OpenMP) programming interface, which is a software interface that helps to execute different sections of a program on a multi core processor. The second approach is an optimized method that we have proposed in order to distribute the processing and to synchronize the eight cores of the C6678 DSP. The proposed method gives the best performance. Indeed, a parallel efficiency of 94% was obtained when the eight cores were activated.

show abstract

Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage

Cited by 9 publications

References 33 publications

High-Throughput Polynomial Multiplier Architecture for Lattice-Based Cryptography

High-Throughput Polynomial Multiplier Architecture for Lattice-Based Cryptography

Efficient FFT Computation in IFDMA Transceivers

Parallel implementation of pulse compression method on a multi-core digital signal processor

Contact Info

Product

Resources

About