Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the same computing kernel. GPUs exploit this parallelism in two ways. First, threads are grouped into fixed-size SIMD batches known as warps, and second, many such warps are concurrently executed on a single GPU core. Despite these techniques, the computational resources on GPU cores are still underutilized, resulting in performance far short of what could be delivered. Two reasons for this are conditional branch instructions and stalls due to long latency operations.To improve GPU performance, computational resources must be more effectively utilized. To accomplish this, we propose two independent ideas: the large warp microarchitecture and two-level warp scheduling. We show that when combined, our mechanisms improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip.
HPS (High Performance Substrate) is a new microarchitecture targeted for implementing very high performance computing engines. Our model of execution is a restriction on fine granularity data flow. This paper introduces the model, provides the rationale for its selection, and describes the data path and flow of instructions through the microengine.
Recent studies have concluded that little parallelism Q 1991 ACM 0-89791 -394-9/91 /0005/0276$1 .50 276 2 The RDF Model of Execution To exploit whatever parallelism exists in the instruction stream, one needs an execution model devoid of artifacts that limit the utilization of that parallelism. The abstract restricted data flow (RDF) paradigm is such a model. It is characterized by three parameters: window size, issue rate, and instruction class latencies.
HPS is a new model for a high performance microarchitecture which is targeted for implementing very dissimilar ISP architectures. It derives its performance from executing the operations within a restricted windopr of a program out-of-order, asynchronously, and concurrently whenever possible. Before the model can be reduced to an effective working implementation of a particular target architecture, several issues need to be resolved. This paper discusses these issues, both in general and in the context of architectures with specific characteristics.
I n t r o d u c t i o n .
O v e r v i e w .The Aquarius project [1] has, as the fundamental goal of its research, to establish the principles by which very large improvements in performance can be achieved in machines specialized for calculating difficult problems in design automation, expert systems, and signal processing. These problems are characterized by having substantial numeric and symbolic components. We are committed to the eventual design of a very high performance heterogeneous MIMD multiprocessor tailored to the execution of both numeric and logic calculations. Aquarius began in 1983. By 1985 we had completed and demonstrated the Aquarius I system [15] which was a small heterogeneous multiprocessor. Aquarius I achieved about an order of magnitude higher performance than had been achieved up to that time. For example, the Japanese Fifth Generation Computer 'PSI' had achieved 30 KLIPS in 1985. We are currently focusing on an experimental multiprocessor architecture (Aquarius II) for the high performance execution of Prolog that will contain 12 processors specialized for Prolog and others for a total of 16 processors.
R e s e a r c h M e t h o d o l o g yIt is worth stating at the outset a number of key concepts which reflect our fundamental methodology for doing research in high performance knowledge processing systems. We believe in a research environment where systems evolve, taking advantage of contributions from a number of sources, both within and outside Berkeley.Second, we believe that issues should be dealt with as quickly and inexpensively as possible: by gadanken experiments, if possible, else analyzing, else simulation, else emulation and finally, only if required, by constructing and analyzing machines.Third, the nature of the high performance execution demands the effective utilization of enormous amounts of memory, coupled both loosely and tightly, it involves exploiting parallelism at both course and fine grain granularities, and it necessitates modularization of the system architecture to accommodate improvements in any element in the structure.Fourth, we are interested in proving concepts, rather than engineering manufactured parts. Thus, we are interested in building experimental architectures which can then be transferred to sites more appropriate than us for fabrication to achieve higher performance and more reliable systems. We are interested in using as many standard components and buses as possible in the experimental machine. This will facilitate the rapid transfer of the architecture technology.Fifth, we believe in working closely with government and industry.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.