“…The accelerator discussed in this paper was developed using the CUDA development tools [27], targeting the NVIDIA GeForce GTX 1070 GPU [28] which features 1920 CUDA cores, 120 texture mapping units (TMUs), 1.5 MB of shared memory, 4 MB of local memory, 8 GB of GDDR5 memory, and 15 SMs. We compare the performance of the GPU accelerator with an FPGA implementation [19], which was developed using the Vitis Unified Software Platform [29] for the AMD Alveo U280 [30]. The FPGA used in [19] is based on the same 16 nm technology node as the GPU and contains 9024 digital signal processing (DSP) blocks, 41 MB of on-chip static RAM, 1,303,680 look-up tables, and 8 GB of high bandwidth memory (HBM2).…”