2011
DOI: 10.1260/1748-3018.5.2.341
|View full text |Cite
|
Sign up to set email alerts
|

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

Abstract: Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Especially tuning the program for the GPU memory hierarchy whose organization and performance implications are radically different from those of general purpose CPUs; and optimizing programs at the instruction-level for the GPU. In this paper we analyze different approaches for optimizing the memory usage and a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2012
2012
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(14 citation statements)
references
References 13 publications
(9 reference statements)
0
14
0
Order By: Relevance
“…These models greatly improve issues of premature convergence or convergence to local optimal solutions in past GA methods, where a single large population is used. To take advantage multitasking features of multi-core CPUs, 15 these models implement parallel computing on CPU by allocating unused cores with the blocks not yet been operated. The GPU varies greatly from the CPU in parallel computing.…”
Section: Parallel Gamentioning
confidence: 99%
See 1 more Smart Citation
“…These models greatly improve issues of premature convergence or convergence to local optimal solutions in past GA methods, where a single large population is used. To take advantage multitasking features of multi-core CPUs, 15 these models implement parallel computing on CPU by allocating unused cores with the blocks not yet been operated. The GPU varies greatly from the CPU in parallel computing.…”
Section: Parallel Gamentioning
confidence: 99%
“…Computers usually use long sequences of numbers and a seed for table lookup to generate random numbers. We use NVIDIA's Curand.kernel 14,15 and treat each thread ID as the seed of this thread, so that each thread can have its own independent random field. A global evaluation is performed after each thread performs evaluation of its own chromosomes and saves the evaluation results into global memory.…”
Section: Parallel Simd-based Algorithmmentioning
confidence: 99%
“…Using GPU streams, commands for memory transfers and kernel function executions that belong to different streams can be overlapped. In our implementations, we store data in the structure-of-array (SOA) format [28] in order to maximize the possibility of using coalesced memory transactions. In the SOA format, the individual attributes of each record are stored contiguously so that component-wise memory access by threads is possible regardless of the size of a record.…”
Section: Gpu Overheadmentioning
confidence: 99%
“…This results in as many uncoalesced memory reads from the global memory as there are the dimensions of data whenever the elements of such a structure must be accessed by a thread. Conversely, the SOA format guarantees that all reads from global memory are coalesced, regardless of the number of dimensions of data, since all threads of the same warp-half access consecutive single values in the global memory [28]. An in-memory R-tree called Q-tree, which means 'Querytree' is used for managing the GPU buffer.…”
Section: Gpu-based Range Querymentioning
confidence: 99%
“…extensively discussed in technical manuals for various many-core devices, e.g., CPU [7], GPU [14] or the Cell processor [5]. The major choices of AoS and SoA can be further refined to form hybrid formats, e.g., arrays of structures of arrays [1] or structures of arrays of structures [16].…”
Section: Introductionmentioning
confidence: 99%