Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the same computing kernel. GPUs exploit this parallelism in two ways. First, threads are grouped into fixed-size SIMD batches known as warps, and second, many such warps are concurrently executed on a single GPU core. Despite these techniques, the computational resources on GPU cores are still underutilized, resulting in performance far short of what could be delivered. Two reasons for this are conditional branch instructions and stalls due to long latency operations.To improve GPU performance, computational resources must be more effectively utilized. To accomplish this, we propose two independent ideas: the large warp microarchitecture and two-level warp scheduling. We show that when combined, our mechanisms improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip.
Aggressive prefetching is very beneficial for memory latency tolerance of many applications. However, it faces significant challenges in multi-core systems. Prefetchers of different cores on a chip multiprocessor (CMP) can cause significant interference with prefetch and demand accesses of other cores. Because existing prefetcher throttling techniques do not address this prefetcher-caused inter-core interference, aggressive prefetching in multi-core systems can lead to significant performance degradation and wasted bandwidth consumption.To make prefetching effective in CMPs, this paper proposes a low-cost mechanism to control prefetcher-caused inter-core interference by dynamically adjusting the aggressiveness of multiple cores' prefetchers in a coordinated fashion. Our solution consists of a hierarchy of prefetcher aggressiveness control structures that combine per-core (local) and prefetcher-caused intercore (global) interference feedback to maximize the benefits of prefetching on each core while optimizing overall system performance. These structures improve system performance by 23% while reducing bus traffic by 17% compared to employing aggressive prefetching and improve system performance by 14% compared to a state-of-the-art prefetcher aggressiveness control technique on an eight-core system.
Existing DRAM controllers employ rigid, non-adaptive scheduling and buffer management policies when servicing prefetch requests. Some controllers treat prefetch requests the same as demand requests, others always prioritize demand requests over prefetch requests. However, none of these rigid policies result in the best performance because they do not take into account the usefulness of prefetch requests. If prefetch requests are useless, treating prefetches and demands equally can lead to significant performance loss and extra bandwidth consumption. In contrast, if prefetch requests are useful, prioritizing demands over prefetches can hurt performance by reducing DRAM throughput and delaying the service of useful requests. This paper proposes a new low-cost memory controller, called Prefetch-Aware DRAM Controller (PADC), that aims to maximize the benefit of useful prefetches and minimize the harm caused by useless prefetches. To accomplish this, PADC estimates the usefulness of prefetch requests and dynamically adapts its scheduling and buffer management policies based on the estimates. The key idea is to 1) adaptively prioritize between demand and prefetch requests, and 2) drop useless prefetches to free up memory system resources, based on the accuracy of the prefetcher. Our evaluation shows that PADC significantly outperforms previous memory controllers with rigid prefetch handling policies on both single-and multi-core systems with a variety of prefetching algorithms. Across a wide range of multiprogrammed SPEC CPU 2000/2006 workloads, it improves system performance by 8.2% on a 4-core system and by 9.9% on an 8-core system while reducing DRAM bandwidth consumption by 10.7% and 9.4% respectively.
A primary use of chip-multiprocessor (CMP) systems is to speed up a single application by exploiting thread-level parallelism. In such systems, threads may slow each other down by issuing memory requests that interfere in the shared memory subsystem. This inter-thread memory system interference can significantly degrade parallel application performance. Better memory request scheduling may mitigate such performance degradation. However, previously proposed memory scheduling algorithms for CMPs are designed for multi-programmed workloads where each core runs an independent application, and thus do not take into account the inter-dependent nature of threads in a parallel application.In this paper, we propose a memory scheduling algorithm designed specifically for parallel applications. Our approach has two main components, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers. First, the runtime system estimates threads holding the locks that cause the most serialization as the set of limiter threads, which are prioritized by the memory scheduler. Second, the memory scheduler shuffles thread priorities to reduce the time threads take to reach the barrier. We show that our memory scheduler speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.
Chip multiprocessors (CMPs) share a large portion of the memory subsystem among multiple cores. Recent proposals have addressed high-performance and fair management of these shared resources; however, none of them take into account prefetch requests. Without prefetching, significant performance is lost, which is why existing systems prefetch. By not taking into account prefetch requests, recent shared-resource management proposals often significantly degrade both performance and fairness, rather than improve them in the presence of prefetching.This paper is the first to propose mechanisms that both manage the shared resources of a multi-core chip to obtain highperformance and fairness, and also exploit prefetching. We apply our proposed mechanisms to two resource-based management techniques for memory scheduling and one source-throttling-based management technique for the entire shared memory system. We show that our mechanisms improve the performance of a 4-core system that uses network fair queuing, parallelism-aware batch scheduling, and fairness via source throttling by 11.0%, 10.9%, and 11.3% respectively, while also significantly improving fairness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.