Optimization Techniques for GPU Programming

Hijma, Pieter; Heldens, Stijn; Sclocco, Alessio; Werkhoven, Ben van; Bal, Henri E.

doi:10.1145/3570638

Cited by 29 publications

(19 citation statements)

References 313 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each call to the CUDA kernel creates a new Grid, which is composed of multiple Blocks. Each Block is composed of up to 1024 separate Threads (Hijma et al 2023). As shown in Figure 6, Grid can control the number of Blocks by setting three-dimensions: gridDim x .…”

Section: Folding and Integratingmentioning

confidence: 99%

PSRDP: A Parallel Processing Method for Pulsar Baseband Data

Zhang,

Wang

et al. 2024

Res. Astron. Astrophys.

View full text Add to dashboard Cite

To address the problem of real-time processing of ultra-wide bandwidth pulsar baseband data, we designed and implemented a pulsar baseband data processing algorithm (PSRDP) based on GPU parallel computing technology. PSRDP can perform operations such as baseband data unpacking, channel separation, coherent dedispersion, stokes detec tion, phase and folding period prediction, and folding integration in GPU clusters. We tested the algorithm using the J0437-4715 pulsar baseband data generated by the CASPSR and Medusa backends of the Parkes, and the J0332+5434 pulsar baseband data generated by the self-developed backend of the NSRT. We obtained the pulse profiles of each baseband data. Through experimental analysis, we have found that the pulse profiles generated by the PSRDP algorithm in this paper are essentially consistent with the processing results of Digital Signal Processing Software for Pulsar Astronomy (DSPSR), which verified the effectiveness of the PSRDP algorithm. Furthermore, Using the same baseband data, we compared the processing speed of PSRDP with DSPSR, and the results showed that PSRDP was not slower than DSPSR in terms of speed. The theoretical and technical expe rience gained from the PSRDP algorithm research in this article lays a technical foundation for the real-time processing of QTT ultra-wide bandwidth pulsar baseband data.

show abstract

Section: Folding and Integratingmentioning

confidence: 99%

PSRDP: A Parallel Processing Method for Pulsar Baseband Data

Zhang,

Wang

et al. 2024

Res. Astron. Astrophys.

View full text Add to dashboard Cite

show abstract

“…Shared memory gives us much room for software optimization because of its programmability. For example, the access latency of global memory is 100 times lower than that of Shared memory in some GPU architectures [9]. Reasonable use of programmable Shared memory can significantly improve the performance of the computation, and related techniques are used in this paper.…”

Section: Gpu Hierarchy Memorymentioning

confidence: 99%

“…However, there is a lack of support and data for GPU hardware. In addition, the identification of residues in this algorithm does not consider that SIMD instructions [9] can further improve performance. The uneven allocation of computational resources in the Integration step leads to many wasted resources in the early stage of the algorithm process and a tight computational resource situation in the later stage.…”

Section: Introductionmentioning

confidence: 99%

Efficient GPU acceleration for phase unwrapping algorithm

Li,

Han,

et al. 2023

Optical Metrology and Inspection for Industrial Applications X

View full text Add to dashboard Cite

Optical testing is constantly evolving, necessitating higher lateral resolution in interferometry. Achieving high resolution leads to longer processing times, significantly impacting testing efficiency. The unwrapping phase algorithm is crucial in interferometry, but its complex calculations can impede efficiency improvements. There are two types of algorithms for the unwrapping phase: path-dependent and path-independent. Path-dependent algorithms tend to be more efficient, and thus, we have chosen to utilize the accelerated path-dependent algorithm. Among these algorithms, Goldstein's algorithm is widely applied. This study uses CPU-GPU heterogeneous computing to parallelize and accelerate the Goldstein phase unwrapping algorithm while maintaining acceptable numerical error limits. Our proposal focuses on optimizing the serial Goldstein algorithm for GPU architectures by parallelizing and enhancing three key steps: residue identification, branch cutting, and integration. Specifically, our optimization approach leverages GPU shared memory and SIMD functionality. To assess the efficiency of our proposed method, we conducted tests on the unwrapped phase image with varying pixel sizes. The results demonstrate that as the pixel size increases, the performance gain from GPU computation becomes more pronounced compared to CPU computation. Using a 4096×4096 phase diagram on the RTX3070 laptop hardware, we achieved a 60x speed increase in the overall process compared to the CPU version. Therefore, employing this algorithm with the GPU can significantly expedite the phase unwrapping process and enhance the efficiency of interferometry.

show abstract

“…One of the main barriers to entry for other researchers may be the perceived difficulty of GPU programming. Writing efficient code using CUDA or OpenCL requires careful consideration of memory access, balancing resource usage when mapping parallel processes to the hardware, and interaction between the CPU and GPU (Hijma et al 2022). There have been efforts to make GPU-accelerated value iteration more accessible.…”

Section: Introductionmentioning

confidence: 99%

Going faster to see further: GPU-accelerated value iteration and simulation for perishable inventory control using JAX

Farrington¹,

Li²,

Wong³

et al. 2023

Preprint

View full text Add to dashboard Cite

Value iteration can find the optimal replenishment policy for a perishable inventory problem, but is computationally demanding due to the large state spaces that are required to represent the age profile of stock. The parallel processing capabilities of modern GPUs can reduce the wall time required to run value iteration by updating many states simultaneously. The adoption of GPU-accelerated approaches has been limited in operational research relative to other fields like machine learning, in which new software frameworks have made GPU programming widely accessible. We used the Python library JAX to implement value iteration and simulators of the underlying Markov decision processes in a high-level API, and relied on this library's function transformations and compiler to efficiently utilize GPU hardware. Our method can extend use of value iteration to settings that were previously considered infeasible or impractical. We demonstrate this on example scenarios from three recent studies which include problems with over 16 million states and additional problem features, such as substitution between products, that increase computational complexity. We compare the performance of the optimal replenishment policies to heuristic policies, fitted using simulation optimization in JAX which allowed the parallel evaluation of multiple candidate policy parameters on thousands of simulated years. The heuristic policies gave a maximum optimality gap of 2.49%. Our general approach may be applicable to a wide range of problems in operational research that would benefit from large-scale parallel computation on consumer-grade GPU hardware.

show abstract

Optimization Techniques for GPU Programming

Cited by 29 publications

References 313 publications

PSRDP: A Parallel Processing Method for Pulsar Baseband Data

PSRDP: A Parallel Processing Method for Pulsar Baseband Data

Efficient GPU acceleration for phase unwrapping algorithm

Going faster to see further: GPU-accelerated value iteration and simulation for perishable inventory control using JAX

Contact Info

Product

Resources

About