Adaptive and transparent cache bypassing for GPUs

Li, Ang; Braak, Gert-Jan van den; Kumar, Akash; Corporaal, Henk

doi:10.1145/2807591.2807606

Cited by 62 publications

(29 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To summarize, we argue that GPUs are the promising platform for the ALS workload when taking both performance and power consumption into account. In the future, we will further investigate the performance gap between platforms and push the factorizing performance to the hardware limit (in particular on newer Intel Xeon Phi processors with onpackage high bandwidth memory [35,36], newer GPUs on warp-level [37,38], CTA-level [39] and cache-level [40], and other emergent accelerators such as Matrix-2000 [41]).…”

Section: Applying Optimizationsmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

View full text Add to dashboard Cite

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8×-15.7× faster on an Intel 16-core CPU, 23.9×-87.9× faster on an NVIDIA K20C GPU and 34.6×-97.1× faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

show abstract

Section: Applying Optimizationsmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…In the worst case, due to the lack of L2 cache capacity, it is sometimes necessary to load the evicted data from the off-chip memory. 6,31,[33][34][35][36][37][38][39][40][41] Shared memory is an alternative to the L1 cache for storing preloaded data. There are several reasons to support this.…”

Section: Preloading In the Shared Memorymentioning

confidence: 99%

“…As many previous research studies have shown, effectively hiding cache resource contention is a crucial step to achieving high performance on GPUs. 6,31,[33][34][35][36][37][38][39][40][41]43 Previous studies of resolving the resource contention problems are based on dynamic analysis methods that require hardware modification. In addition to preloading in shared memory efficiently, it is necessary to combine static analysis to avoid the L1 cache from the resource contentions effectively.…”

Section: Impact Of Various Preload Factorsmentioning

confidence: 99%

Static code transformations for thread‐dense memory accesses in GPU computing

Kim

Hong

Park

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Due to the GPU's complex memory system and massive thread-level parallelism, application programmers often have difficulty optimizing GPU programs. An essential approach to memory optimization is to utilize low-latency on-chip memory to avoid high latency of off-chip memory accesses. Shared memory is an on-chip memory, which is explicitly managed by programmers.Shared memory has a read/write latency similar to that of the L1 cache, but poor data management can degrade performance. In this paper, we present a static code transformation that preloads dataset in GPU's shared memory. Our static analysis primarily targets global memory requests with high thread-density for preloading in shared memory. The thread-dense memory access pattern is a pattern in which many threads efficiently manage the address space of shared memory, as well as reuse the same data in a thread block. We limit the usage of shared memory so that thread-level parallelism remains at the same level when selecting datasets for preloading. Finally, our source-to-source compiler allows to preload selected datasets in shared memory by transforming non-optimized GPU kernel code. Our methods achieve 1.26× and 1.62× speedups on average (geometric mean), respectively with GTX980 and P100 GPUs. KEYWORDScode transformation, GPU computing, shared memory, static analysis INTRODUCTIONGraphics processing units (GPUs) are very useful in accelerating scientific applications and even in accelerating machine learning applications.To take advantage of GPU computing power in these applications, it is essential to reduce the high-latency off-chip memory access. Several memory access transformation methods have been explored to optimize off-chip memory accesses by utilizing low-latency on-chip memory. 1-8The use of shared memory, which is located in on-chip and explicitly managed by user-written kernel codes, is one way to avoid the high-latency overhead within off-chip memory access. 9-16 Despite the beneficial characteristics of shared memory, applications often leave out shared memory unused, mainly due to the extra management of address space in shared memory. Domain-specific programmers prefer using the hardware-managed L1 cache rather than shared memory for programming simplicity. 2,17 None of 13 applications in PolyBench benchmark and only 14 of 23 applications in Rodinia benchmarks use shared memory. 18,19Complex memory system and massive thread-level parallelism often make it prohibitively difficult for domain-specific application programmers to optimize memory access patterns in GPU computing. Furthermore, GPU architectures are evolving rapidly, which makes developers rewrite GPU kernels for different generations. To overcome these hurdles, studies of compiler-based optimization and analysis tools are carried out to support programmers with no in-depth knowledge of the GPU architecture. 5,6,11,[20][21][22][23][24] This paper proposes a static code transformation for preloading data in shared memory of GPUs. Our software-only approach focuses primarily on off...

show abstract

“…If the attacker detects the protection and then changes to use L1 data cache, Tangram will eliminate the covert channel formed through L1 data cache using cache bypassing. Previous studies show that the GPU L1D cache miss rate is so high so that the performance is not harmed when the GPU L1-D cache is bypassed [11,23,34,35,61,68]. Therefore, Tangram selectively bypasses the L1-D cache requests if the attacks are detected on the L1D cache instead.…”

Section: Tangram: Attack Mitigationmentioning

confidence: 99%

GPUGuard

Naghibijouybari

Wang

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Graphics processing units (GPUs) are moving towards supporting concurrent kernel execution where multiple kernels may be co-executed on the same GPU and even on the same streaming multiprocessor (SM) core. While concurrent kernel execution improves hardware resource utilization, it opens up vulnerabilities to covert-channel and side-channel attacks. These attacks exploit information leakage across kernels that results from contention on shared resources; they have been shown to be a dangerous threat on CPUs, and are starting to be demonstrated on GPUs. The unique micro-architectural features of GPUs, such as specialized cache structures and massive parallel thread support, create opportunities for GPU-speciic channels to be formed. In this paper, we propose GPUGuard, a decision tree based detection and a hierarchical defense framework that can reliably close the covert channels. Our results show that GPUGuard can detect contention with 100% sensitivity and a small (8.5%) false positive rate. The timing channels are mitigated through Tangram, a GPU-speciic contention channel elimination scheme, with only 8% to 23% overhead when there is an attack and zero performance overhead when no attacks are detected. Compared to temporal partitioning, GPUGuard is 69%-96% faster in various architectures even when active, showing that it is possible to gain substantial performance from executing concurrent kernels on a single SM while securing GPUs against these attacks. CCS CONCEPTS • Computer systems organization → Single instruction, multiple data; • Security and privacy → Artiicial immune systems; Side-channel analysis and countermeasures.

show abstract

Adaptive and transparent cache bypassing for GPUs

Cited by 62 publications

References 28 publications

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Static code transformations for thread‐dense memory accesses in GPU computing

GPUGuard

Contact Info

Product

Resources

About