Analyzing program flow within a many-kernel OpenCL application

Mistry, Perhaad; Gregg, Chris; Rubin, Norman; Kaeli, David; Hazelwood, Kim

doi:10.1145/1964179.1964193

Cited by 26 publications

(18 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The benchmarks we used to evaluate our proposed VirtCL framework were collected from the AMD Accelerated Parallel Processing (APP) SDK [2] and the Rodinia 2.1 benchmark suite [8]. A real-world application (clsurf [27]) was also used to evaluate the effectiveness and scalability of the VirtCL framework. Figure 5 shows the normalized execution times and their contributing components (which were measured by instrumenting gettimeofday(), the RDTSC instruction [18], and clGetEventProfilingInfo()) for the Rodinia benchmark suite when using the native OpenCL library and the proposed VirtCL library on the aforementioned platform with only one GPU device.…”

Section: Evaluations and Discussionmentioning

confidence: 99%

VirtCL: a framework for OpenCL device abstraction and management

You

Tsai

et al. 2015

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention-and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the timeconsuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.

show abstract

Section: Evaluations and Discussionmentioning

confidence: 99%

VirtCL: a framework for OpenCL device abstraction and management

You

Tsai

et al. 2015

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

show abstract

“…Mistry et al [11] developed a profiling technique for analyzing data flow in multi-kernel OpenCL applications. This approach does not apply optimizations but allows programmers to identify bottlenecks by manually inspecting a profiling trace.…”

Section: Related Workmentioning

confidence: 99%

Helium: a transparent inter-kernel optimizer for OpenCL

Lutz

Fensch

Cole

2015

Proceedings of the 8th Workshop on General Purpose Processing Using GPUs

View full text Add to dashboard Cite

State of the art automatic optimization of OpenCL applications focuses on improving the performance of individual compute kernels. Programmers address opportunities for inter-kernel optimization in specific applications by ad-hoc hand tuning: manually fusing kernels together. However, the complexity of interactions between host and kernel code makes this approach weak or even unviable for applications involving more than a small number of kernel invocations or a highly dynamic control flow, leaving substantial potential opportunities unexplored. It also leads to an over complex, hard to maintain code base.We present Helium, a transparent OpenCL overlay which discovers, manipulates and exploits opportunities for inter-and intrakernel optimization. Helium is implemented as preloaded library and uses a delay-optimize-replay mechanism in which kernel calls are intercepted, collectively optimized, and then executed according to an improved execution plan. This allows us to benefit from composite optimizations, on large, dynamically complex applications, with no impact on the code base. Our results show that Helium obtains at least the same, and frequently even better performance, than carefully handtuned code. Helium outperforms handoptimized code where the exact dynamic composition of compute kernel cannot be known statically. In these cases, we demonstrate speedups of up to 3x over unoptimized code and an average speedup of 1.4x over hand optimized code.

show abstract

“…An ipoint contains four different information pieces [16]. The first part contains the location of the point in the image.…”

Section: Related Workmentioning

confidence: 99%

“…Surf Implementations 1) CLSurf: [17] is an OpenCL implementation of the SURF algorithm, developed by the NUCAR group (Northeastern University Computer Architecture Research Group) and AMD. They explain the implementation and they make a performance analysis in [16]. We choose this code to perform the tests based on the high quality of the implemention.…”

Section: Benchmark Setupmentioning

confidence: 99%

Speeded-up robust features (SURF) as a benchmark for heterogeneous computers

Iparraguirre

Balmaceda

Mariani

2014

2014 IEEE Biennial Congress of Argentina (ARGENCON)

View full text Add to dashboard Cite

Heterogeneous computers performance measurement is not a trivial task. Because of the implicit diversity of the architectures, it is hard to define a clear performance metric. We used SURF as a benchmark for heterogeneous computers. We measure results on multiple computers using OpenCL and CUDA implementations. Results show consistent measurements among platforms, implementations, and dataset size. Finally, we conclude that the relevant factor for this kind of application is the amount of GPU cores.Resumen-La medición de desempeño de computadoras heterogéneas no es una tarea trivial. Debido a la diversidad implícita de de las arquitecturas, es difícil definir una clara métrica de desempeño. Usamos SURF como banco de pruebas para computadoras heterogéneas. Medimos resultados en múltiples computadoras usando implementaciones en OpenCL y CUDA. Los resultados muestran mediciones consistentes entre plataformas, implementaciones y tamaño del conjunto de datos. Finalmente, concluimos que el factor relevante para este tipo de aplicación es la cantidad de procesadores en el GPU.

show abstract

Analyzing program flow within a many-kernel OpenCL application

Cited by 26 publications

References 14 publications

VirtCL: a framework for OpenCL device abstraction and management

VirtCL: a framework for OpenCL device abstraction and management

Helium: a transparent inter-kernel optimizer for OpenCL

Speeded-up robust features (SURF) as a benchmark for heterogeneous computers

Contact Info

Product

Resources

About