Floating-Point Exponentiation Units for Reconfigurable Computing

Dinechin, Florent de; Echeverría, Pedro; López‐Vallejo, Marisa; Pasca, Bogdan

doi:10.1145/2457443.2457447

Cited by 16 publications

(4 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For comparison purposes the table also presents the resource requirement of a state-of-the-art natural logarithm implementation based on piecewise polynomial approximation (PA) available in Altera DSP Builder [6], and with an iterative implementation [7] available in the open source FloPoCo tool [8].…”

Section: Resultsmentioning

confidence: 99%

Single Precision Natural Logarithm Architecture for Hard Floating-Point and DSP-Enabled FPGAs

Langhammer

Pasca

2016

2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)

Self Cite

View full text Add to dashboard Cite

Elementary function design has recently been added yet another level of flexibility with the integration of single precision addition and multiplication into the Arria10 DSP block architecture. Implementation techniques developed having floating-point operations support in mind are only available for microprocessors and lead to slow and high-cost implementation when naively ported to FPGA architecture. In this article we show how the new features can be used in conjunction with the existing resources in the design of the natural logarithm elementary function. Compared to traditional FPGA implementations we use Taylor expansion based techniques which enable the use of floating-point adders and multipliers available in DSP blocks for the polynomial evaluation. We show the various tradeoff points of the architecture together with expansion-based techniques for increasing the internal precision in order to produce OpenCL conforming operators. The presented architecture proposes a possible resource tradeoff with significantly lower logic reduction on Arria10 devices.

show abstract

Section: Resultsmentioning

confidence: 99%

Single Precision Natural Logarithm Architecture for Hard Floating-Point and DSP-Enabled FPGAs

Langhammer

Pasca

2016

2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The softmax layer at the end of the classifier and the cross-entropy loss require computation of exponential and logarithmic functions. Since softmax and loss contribute to a negligible portion of the total workload and their accurate calculation requires complicated hardware [9,21], we assign their computation to CPU. Section 4.4 describes the scheduling.…”

Section: Accelerator Design 41 Overviewmentioning

confidence: 99%

GraphACT

Zeng

Prasanna

2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Graph Convolutional Networks (GCNs) have emerged as the stateof-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to propagate information within the graph, and (2) intensive computation to propagate information along the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations. We first analyze the computation and communication characteristics of various GCN training algorithms, and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we propose a light-weight pre-processing step based on a graph theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements and the computation to be performed on the FPGA. To accelerate the weight update in GCN layers, we propose a systolic array based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline, and analyze its load-balance and resource utilization by accurate performance modeling. We evaluate our design on a Xilinx Alveo U200 board hosted by a 40-core Xeon server. On three large graphs, we achieve an order of magnitude training speedup with negligible accuracy loss, compared with state-of-the-art implementation on a multi-core platform.

show abstract

“…The power operator could be implemented such as in [4] in the near future. This would allow us to obtain the latency of the two last benchmarks.…”

Section: Benchmarks That Exposed the Limitations Of Hls Toolsmentioning

confidence: 99%

Bridging high-level synthesis and application-specific arithmetic: The case study of floating-point summations

Uguen

Dinechin²,

Derrien³

2017

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Self Cite

View full text Add to dashboard Cite

Floating-Point Exponentiation Units for Reconfigurable Computing

Cited by 16 publications

References 36 publications

Single Precision Natural Logarithm Architecture for Hard Floating-Point and DSP-Enabled FPGAs

Single Precision Natural Logarithm Architecture for Hard Floating-Point and DSP-Enabled FPGAs

GraphACT

Bridging high-level synthesis and application-specific arithmetic: The case study of floating-point summations

Contact Info

Product

Resources

About