uBench: exposing the impact of CUDA block geometry in terms of performance

Torres, Yuri; González-Escribano, Arturo; Llanos, Diego R.

doi:10.1007/s11227-013-0921-z

Cited by 23 publications

(36 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In the current prototype, the CPU threads granularity is determined by a simple regular blocking policy, that does not require a specific kernel characterization. For GPU kernels, the library integrates the model presented in [15,21]. This model allows the determination of configuration parameters (grid, threadblock, and L1 cache memory sizes), for NVIDIA's GPUs.…”

Section: Controllers Librarymentioning

confidence: 99%

Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming

Moreton-Fernandez

González-Escribano

Llanos

2017

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

Current HPC clusters are composed by several machines with different computation capabilities and different kinds and families of accelerators. Programming efficiently for these heterogeneous systems has become an important challenge. There are many proposals to simplify the programming and management of accelerator devices, and the hybrid programming, mixing accelerators and CPU cores. However, the portability compromises in many cases the efficiency on different devices, and there are details about the coordination of different types of devices that should be still tackled by the programmer.In this work we introduce the Multi-Controler (MCtrl), an abstract entity implemented in a library, that coordinates the management of heterogeneous devices, including accelerators with different capabilities and sets of CPUcores. Our proposal improves state-of-the-art solutions, simplifying the data partition, mapping, and transparent deployment of both, simple generic kernels portable across different device types, and specialized implementations defined and optimized using specific native or vendor programming models (such as CUDA for NVIDIA's GPUs, or OpenMP for CPU-cores). The runtime system automatically selects and deploys the most appropriate implementation of each kernel for each device, managing the data movements, and hiding the launching details. Results of an experimental study with five study cases indicates that our abstraction allows the development of flexible and high efficient programs, that adapt to the heterogeneous environment.

show abstract

Section: Controllers Librarymentioning

confidence: 99%

Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming

Moreton-Fernandez

González-Escribano

Llanos

2017

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

show abstract

“…Kepler es una evolución de Fermi, con más recursos y con muchas características nuevas, pero en lo básico, lo descrito para la optimización de Fermi en el Apartado 2.4.3.2.1 es válido para Kepler, con trabajos que así lo avalan [97], aunque, con algunos matices. Así, la elección del tamaño de bloque y su forma, es una de las decisiones más importantes que el programador debe tomar cuando codifica en CUDA un algoritmo paralelo.…”

Section: Optimización En Keplerunclassified

Consumo energético de métodos iterativos para sistemas dispersos en procesadores gráficos

Badenes¹

View full text Add to dashboard Cite

“…Las guías de programación que ofrece CUDA sugieren el uso de determinados valores para obtener buenos rendimientos. Sin embargo, algunos estudios [15,16] han demostrado que en algunos estas recomendaciones no siempre devuelven rendimientos óptimos, obligando a los programadores a realizar test de prueba-y-error para encontrar los valores que se ajustan a los mejores rendimientos.…”

Section: Examples Of a Graph Withunclassified

“…Esta técnica nos permitirá resolver el problema del APSP a través del método basado en productividad conocido como n×SSSP , donde se ejecuta cada SSSP con un nodo origen diferente de manera independientemente. Refinaremos un modelo de caracterización de kernels ya existente [16], considerando no sólo la nueva funcionalidad de la ejecución concurrente de kernels, sino también para que tenga en cuenta algunas de las características de los grafos de entrada.…”

Section: P Ropuesta Y Desarrollounclassified

“…The CUDA programming guidelines suggest the use of threadblock sizes that maximize the occupancy, for obtaining a good performance. However, some studies [15,16] have shown that, in some cases, these values recommended by CUDA do not always lead to the optimum performance, leaving to the programmers the task of searching for the best values through time-consuming, trial-and-error tests.…”

Section: Gpus For Parallel Computingmentioning

confidence: 99%

See 1 more Smart Citation

Parallel approaches to shortest-path problems for multilevel heterogeneous computing

Arranz¹

View full text Add to dashboard Cite

Modelo paralelo abstracto, APSP, tuning automático de kernels, Boost Graph Library, configuración de la Cache L1, ejecución concurrente de kernels, CUDA, Dijkstra, GPGPU, sistemas heterogéneos, framework HPC, modelo de caracterización de kernels, proceso de caracterización de kernels, balanceo de carga, MPI, comparativa de plataformas de NVIDIA, OpenMP, técnicas de optimización, algoritmos paralelos, SSSP, geometría del bloque de hilos.

show abstract

uBench: exposing the impact of CUDA block geometry in terms of performance

Cited by 23 publications

References 1 publication

Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming

Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming

Consumo energético de métodos iterativos para sistemas dispersos en procesadores gráficos

Parallel approaches to shortest-path problems for multilevel heterogeneous computing

Contact Info

Product

Resources

About