Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6,5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.
The goal of this study is twofold: to analyze in detail the nature of conditional branch mispredictions in correlationbased branch predictors, and, based on this analysis, to reduce the impact of branch mispredictions on processor performance by decreasing the branch resolution delay instead of improving the branch prediction accuracy.We classify conditional branches with the highest number of mispredictions according to the nature of their branch condition analytical expression. Based on these expressions, we can analyze and even precisely explain the origin of mispredictions in many cases. Moreover, we find that many such branches belong to small sets of blocks inside loops, and within such sets we find that some of the branch expressions have regularity properties. We show how to exploit this regularity property by anticipating the branch outcome, where anticipation is a combination of value prediction and normal dataflow execution.We investigate a hardware mechanism to implement the concept of branch outcome anticipation. This mechanism relies on the separate execution of the normal program flow and a branch flow, which is a subset of the program flow corresponding to copies of the instructions needed to compute branch outcomes. The branch flow uses the regularity properties of branch condition expressions to get ahead of the normal program flow whenever possible. Currently, the mechanism can only target a subset of the conditional branches, but with these branches we experimentally show that the anticipation mechanism successfully reduces the average branch misprediction latency by 60%.
This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2 nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.
Abstract-The rapid pace of change in 3D game technology makes workload characterization necessary for every game generation. Comparing to CPU characterization, far less quantitative information about games is available. This paper focuses on analyzing a set of modern 3D games at the API call level and at the microarchitectural level using the Attila simulator. In addition to common geometry metrics and, in order to understand tradeoffs in modern GPUs, the microarchitectural level metrics allow us to analyze performance key characteristics such as the balance between texture and ALU instructions in fragment programs, dynamic anisotropic ratios, vertex, z-stencil, color and texture cache performance. I.INTRODUCTIONGPU design and 3D game technology evolve side by side in leaps and bounds. Game developers deploy computationally demanding 3D effects that use complex shader programs with multiple texture accesses that are highly tuned to extract the maximum performance on the existing and soon-to-bereleased GPUs. On the other hand, GPU designers carefully tailor and evolve their designs to cater for the needs of the next generation games expected to be available when the GPU launches. To this end, they put on the market high-end and middle-end graphics cards with substantial vertex/pixel processing enhancements w.r.t. previous generations, expecting to achieve high frame rates in newly released games.As with any other microprocessor design, carefully understanding, characterizing and modeling the workloads at which a given GPU is aimed, is key to predict and meet its performance targets. There is extensive literature characterizing multiple CPU workloads [24][25] [26]. Compared to the CPU world, though, there is a lack of published data on 3D workloads in general, and interactive games in particular. The reasons are manifold: GPUs still have a wide variety of fixed functions that are difficult to characterize and model (texture sampling, for example), GPUs are evolving very fast, with continuous changes in their programming model (from fixed geometry and fixed texture combiners to full-fledged vertex and fragment programs), the games are also rapidly evolving to exploit these new programming models, and new functions are constantly added to the rendering pipeline as higher silicon densities makes the on-die integration of these functions cost-effective (geometry shaders and tessellation, for example). All these reasons combine to produce an accelerated rate of change in the workloads and even relatively recent studies rapidly obsolete.For example, [1][2] characterize the span processing workload in the rasterization stage. However, today all GPUs use the linear edge function rasterization algorithm [6], making span processing no longer relevant. As another example, [1] studies the geometry BW requirements per frame, which has nowadays substantially decreased thanks to the use of indexed modes and storing vertexes in local GPU memory.The goal of this work is to analyze and characterize a set of recent OpenGL (OGL) and ...
scite is a Brooklyn-based startup that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2023 scite Inc. All rights reserved.
Made with 💙 for researchers