Fermi GF100 GPU Architecture

Wittenbrink, Craig M.; Kilgariff, Emmett; Prabhu, A.

doi:10.1109/mm.2011.24

Cited by 157 publications

(92 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Hardware task schedulers such as Carbon [41] lower overheads further for specific problem domains. GPUs [76] and Anton 2 [27] feature custom schedulers for non-speculative tasks. By contrast, Swarm implements speculative hardware task management for a different problem domain, ordered parallelism.…”

Section: Additional Related Workmentioning

confidence: 99%

A scalable architecture for ordered parallelism

Jeffrey¹,

Subramanian²,

Yan³

et al. 2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits.We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51-122× speedups over a single-core system, and outperforms software-only parallel algorithms by 3-18×.

show abstract

Section: Additional Related Workmentioning

confidence: 99%

A scalable architecture for ordered parallelism

Jeffrey¹,

Subramanian²,

Yan³

et al. 2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

show abstract

“…The NYU Ultracomputer [29] proposed implementing atomic fetch-and-add using adders in network switches, which could coalesce multiple requests on their way to memory. The Cray T3D [34], T3E [57], and SGI Origin [42] implemented RMOs at the memory controllers, while TilePro64 [30] and recent GPUs [63] implement RMOs in shared caches. Prior work has also proposed adding caches to memory controllers to accelerate RMOs [68] and data-parallel RMOs [5].…”

Section: Hardware Techniquesmentioning

confidence: 99%

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

Zhang

Horn

Sánchez

2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with updateonly permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model.We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4×.

show abstract

“…OpenCL is a standard parallel programming language for heterogeneous platforms [16]. It is initially designed for GPGPU architectures [4,6]. And it can also be mapped to general purpose CPUs efficiently [23].…”

Section: Related Workmentioning

confidence: 99%

“…The Open Computing Language (OpenCL) is a standard language for programming heterogeneous parallel platforms. Initially it is designed for general purpose computing on Graphic Processing Units (GPUs) [4,6], some of which are also wide SIMD processors. Therefore it is also suitable for programming low-energy SIMD processors.…”

Section: Introductionmentioning

confidence: 99%

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

She

Waeijen

et al. 2014

J Sign Process Syst

View full text Add to dashboard Cite

Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energyefficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.

show abstract

Fermi GF100 GPU Architecture

Cited by 157 publications

References 1 publication

A scalable architecture for ordered parallelism

A scalable architecture for ordered parallelism

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

Contact Info

Product

Resources

About