Matrix multiplication beyond auto-tuning

Steuwer, Michel; Remmelg, Toomas; Dubach, Christophe

doi:10.1145/2968455.2968521

Cited by 15 publications

(1 citation statement)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, several parallel programming frameworks exist [11,17,23,29,38,44,45,47] that enable the compilation of domain-specific languages on GPUs. Lift [26,46] extends its existing data parallel primitive types to accommodate loop tiling (e.g., slide,pad) and its low-level OpenCL with local memory (e.g., toLocal) allocation for stencil computations.…”

Section: Gpu Features Into Programming Languagesmentioning

confidence: 99%

Automatically exploiting the memory hierarchy of GPUs through just-in-time compilation

Papadimitriou

Fumero

Stratikopoulos

et al. 2021

Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

View full text Add to dashboard Cite

Although Graphics Processing Units (GPUs) have become pervasive for data-parallel workloads, the efficient exploitation of their tiered memory hierarchy requires explicit programming. The efficient utilization of different GPU memory tiers can yield higher performance at the expense of programmability since developers must have extended knowledge of the architectural details in order to utilize them.In this paper, we propose an alternative approach based on Just-In-Time (JIT) compilation to automatically and transparently exploit local memory allocation and data locality on GPUs. In particular, we present a set of compiler extensions that allow arbitrary Java programs to utilize local memory on GPUs without explicit programming. We prototype and evaluate our proposed solution in the context of TornadoVM against a set of benchmarks and GPU architectures, showcasing performance speedups of up to 2.5 compared to equivalent baseline implementations that do not utilize local memory or data locality. In addition, we compare our proposed solution against hand-written optimized OpenCL code to assess the upper bound of performance improvements that can be transparently achieved by JIT compilation without trading programmability. The results showcase that the proposed extensions can achieve up to 94% of the performance of the native code, highlighting the efficiency of the generated code.

show abstract