Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?

Jézéquel, Fabienne; Graillat, Stef; Mukunoki, Daichi; Imamura, Toshiyuki; Iakymchuk, Roman

doi:10.1007/978-3-030-63618-0_10

Cited by 3 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the Update routine, the algorithm needs to calculate the sum of data instances in each cluster and then divide the sum by the number of instances in the cluster. Therefore, when a large number of instances are added together one by one naively, the accumulation of rounding errors that may occur finally impairs the clustering quality (see Reference 24 for more illustration of the effect of rounding errors). On the other hand, using double precision ( 64‐bits arithmetic ) can reduce the effect of rounding errors to a satisfying level of accuracy in our use case, but the computational cost is higher (see, e.g., Reference 25).…”

Section: Optimizing Parallel K‐means Algorithmmentioning

confidence: 99%

“…Then the nearest centroid for each instance can be found and recorded (lines [16][17][18]. Finally the cluster label of each instance is updated according to its nearest centroid, and the changes of label are counted into the private track of each thread (lines [22][23][24][25]. The reduction directive sums the private track of all threads (line 3).…”

Section: Computeassign Routinementioning

confidence: 99%

“…For the coalescence of memory access, we use the transposed matrix of data instances (line 18). Then the nearest centroid for each instance can be found and recorded (lines[22][23][24][25]. The cluster label will be changed if necessary, and the change will be marked with 1 in the shared 1D block array shTrack (lines[28][29][30][31].__global__ void ComputeAssign (T_real *GPU_dataT, T_real *GPU_cent, int *GPU_label, unsigned long long int *AdrGPU_track_sum) { int idx = blockIdx.x * BSXN + threadIdx.x; __shared__ unsigned long long int shTrack[BSXN]; shTrack[threadIdx.x] = 0; if (idx < n) { int min = 0; T_real diff, dist_sq, minDist_sq; for (int k = 0; k < kc; k++) { dist_sq = 0.0f; // Calculate the square of distance // between instance idx and centroid k // across nd dimensions for (int j = 0; j < nd; j++) { diff = GPU_dataT[j*n + idx] -GPU_cent[k*nd + j]; dist_sq += (diff*diff); } // Find the nearest centroid to instance idx if (dist_sq < minDist_sq || k == 0the changes of label into "track": // two-part reduction // 1 -Parallel reduction of 1D block shared array ... // shTrack[*] into shTrack[0], ... // kill useless threads step by step, ... // only thread 0 survives at the end // 2 -Final reduction into a global array if (shTrack[0] > 0) atomicAdd(AdrGPU_track_sum, shTrack[0]); } ComputeAssign routine on CPU Finally, we count the changes of label by a two-part reduction.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Parallel and accurate k‐means algorithm on CPU‐GPU architectures for spectral clustering

Vialle

Baboulin

2021

Concurrency and Computation

View full text Add to dashboard Cite

k-Means is a standard algorithm for clustering data. It constitutes generally the final step in a more complex chain of high-quality spectral clustering. However, this chain suffers from lack of scalability when addressing large datasets. This can be overcome by applying also the k-means algorithm as a preprocessing task to reduce the input data instances. We propose parallel optimization techniques for the k-means algorithm on CPU and GPU. Particularly we use a two-step summation method with package processing to handle the effect of rounding errors that may occur during the phase of updating cluster centroids. Our experiments on synthetic and real-world datasets containing millions of instances exhibit a speedup up to 7 for the k-means iteration time on GPU versus 20/40 CPU threads using AVX units, and achieve double-precision accuracy with single-precision computations.

show abstract

Section: Optimizing Parallel K‐means Algorithmmentioning

confidence: 99%

Section: Computeassign Routinementioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Parallel and accurate k‐means algorithm on CPU‐GPU architectures for spectral clustering

Vialle

Baboulin

2021

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Generally, FMA achieves matrix multiplication which is widely used in Convolutional Neural Networks (CNN) [19] and basic linear algebra subprograms (BLAS) [21,22]. Hardware implementations for matrix multiplication have problems with the requirement of many hardware resources [19,20], software interferences [21], memory requirements, and numerical inefficacy [23,24]. Large-size hardware shifters and CSA are required for conventional FMA architectures.…”

Section: Introductionmentioning

confidence: 99%

FPGA Implementation of Embedded Floating-Point Core with Microarchitectural Support

Pitchai,

Pitchai

2023

Preprint

View full text Add to dashboard Cite

Floating point arithmetic is a tedious implementation in General Purpose Processors (GPP) and Application Specific Integrated Circuit (ASIC). ASICs don’t afford the modifications of instruction and algorithms in floating point applications. GPP takes over the floating computations in a separate circuit called a Floating Point Unit (FPU), it makes the FPU consume more power and chip area. This paper presents the FPU which incorporates its hardware circuitry for microarchitectural configuration in FPGA. The proposed FPU provides the RAM blocks for storing the instructions and data alongside floating point arithmetic circuits. Compared to conventional implementation, Fused Multiply-Add (FMA) instruction is implemented with a minimum adder, multiplier, and shifters resources. An instruction stored in RAM selects and controls the floating point operations such as addition, subtraction, multiplication, and Multiply Accumulate (MAC) operations. Pipelined FPU is implemented as a simple microarchitecture in Stratix III FPGA at the maximum operating frequency of 463 MHz; proposed FPU shares some hard floating point circuits in adder multiplier data paths.

show abstract

Error Estimation and Correction Using the Forward CENA Method

Hovland

Hückelheim

2021

Computational Science – ICCS 2021

View full text Add to dashboard Cite

Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?

Cited by 3 publications

References 15 publications

Parallel and accurate k‐means algorithm on CPU‐GPU architectures for spectral clustering

Parallel and accurate k‐means algorithm on CPU‐GPU architectures for spectral clustering

FPGA Implementation of Embedded Floating-Point Core with Microarchitectural Support

Error Estimation and Correction Using the Forward CENA Method

Contact Info

Product

Resources

About