Data Placement Optimization in GPU Memory Hierarchy using Predictive Modeling

Stoltzfus, Larisa; Emani, Murali; Lin, Pei-Hung; Liao, Chunhua

doi:10.1145/3286475.3286482

Cited by 5 publications

(4 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, n is the number of matrices and dim con-tains the dimensions of the matrices. For instance, for A 20×2 × B 2×30 × C 30×12 × D 12×8 , inputs are n = 4 and dim = [20, 2,30,12].…”

Section: Serial Algorithm By Dynamic Programming Methodsmentioning

confidence: 99%

“…They showed that by understanding application I/O patterns and carefully designing data layouts they increased read performance by more than 80%. [12] proposed a machine learning-based approach to build a classifier to determine the best class of GPU memory that will minimize GPU kernel execution time. This approach utilizes a set of performance counters obtained from profiling runs along with hardware features to generate the trained model.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique

Seifi,

Al-Mamun

2024

JCC

View full text Add to dashboard Cite

Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication-a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)-this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique's broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.

show abstract

“…Here, n is the number of matrices and dim con-tains the dimensions of the matrices. For instance, for A 20×2 × B 2×30 × C 30×12 × D 12×8 , inputs are n = 4 and dim = [20, 2,30,12].…”

Section: Serial Algorithm By Dynamic Programming Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique

Seifi,

Al-Mamun

2024

JCC

View full text Add to dashboard Cite

show abstract

“…A magnitude of data input means an instruction requires more SP to process data, which explains why the FP-ACO consumes more hardware resources. Research [28][29][30] shows that determining an appropriate parallel model is vital to the performance of the GPU. As shown in Figure 4, we can infer that when a kernel function runs on GPU, that is a process of the corporation of the grids, the blocks, and the threads.…”

Section: Proposed Methodsmentioning

confidence: 99%

A Fast Fully Parallel Ant Colony Optimization Algorithm Based on CUDA for Solving TSP

Zeng,

Cai,

Chung

et al. 2023

IET Computers & Digital Techniques

View full text Add to dashboard Cite

In view of the known problems of parameter sensitivity, local optimum, and slow convergence in the ant colony optimization (ACO), we aim to improve the performance of the ACO. To solve the traveling salesman problem (TSP) quickly with accurate results, we propose a fully parallel ACO (FP-ACO). Based on the max–min ant system (MMAS), we initiate a compensation mechanism for pheromone to constrain its value, guarantee the correctness of results and avoid a local optimum, and further enhance the convergence ability of ACO. Moreover, based on the compute unified device architecture (CUDA), the ACO is implemented as a kernel function on a graphics processing unit (GPU), which shortens the running time of massive iterations. Combined with the roulette wheel selection mechanism, FP-ACO has powerful search capabilities and is committed to obtaining better solutions. The experimental results show that, compared with the effective strategies ACO (ESACO) that runs on CPU, the speed-up ratio of the proposed algorithm reaches 35, and the running time is less than that of the max–min ant system-roulette wheel method-bitmask tabu (MMAS-RWM-BT) that runs on GPU. Furthermore, our algorithm outperforms the other two algorithms in the speed-up ratio and less runtime, proving that the proposed FP-ACO is more suitable for solving TSP.

show abstract

“…Similarly to our strategy, the framework LIFT [6], [7] extracts low-level features from an intermediate representation (IR), and then uses a machine learning approach to predict performance based on the extracted code features.…”

Section: Related Workmentioning

confidence: 99%

Opening the Black Box: Performance Estimation during Code Generation for GPUs

Ernst¹,

Hager²,

Holzer³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning.This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy.Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the "pystencils" stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate.The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.

show abstract

Data Placement Optimization in GPU Memory Hierarchy using Predictive Modeling

Cited by 5 publications

References 5 publications

Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique

Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique

A Fast Fully Parallel Ant Colony Optimization Algorithm Based on CUDA for Solving TSP

Opening the Black Box: Performance Estimation during Code Generation for GPUs

Contact Info

Product

Resources

About