Architecting Waferscale Processors - A GPU Case Study

Pal, Saptadeep; Petrisko, Daniel; Tomei, Matthew; Gupta, Puneet; Iyer, Subramanian S.; Kumar, Rakesh

doi:10.1109/hpca.2019.00042

Cited by 36 publications

(9 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Next, we evaluate the performance improvement that multinode packaged systems (e.g., MCM-GPU [17], waferscale-GPU [18], Tesla Dojo [19]) can provide in a distributed training setup (see Fig. 11).…”

Section: Effect Of Multi-node Packagementioning

confidence: 99%

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

Ardalani¹,

Pal²,

Gupta³

2022

Preprint

View full text Add to dashboard Cite

Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. We propose CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. We also propose DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technologyhardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.

show abstract

Section: Effect Of Multi-node Packagementioning

confidence: 99%

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

Ardalani¹,

Pal²,

Gupta³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Data is physically moved around, and therefore scheduling the communication operations is of critical importance to minimize qubit movements and consolidate interactions with a given qubit in the minimum amount of time. NoCs are generally not bound to scheduling, although efforts in real-time embedded systems or machine learning accelerators also advocate for it in the classical domain [15,22]. In any case, this aspect is at the frontier between the network and the architecture, Welcome back to circuit switching: Quantum teleportation uses both a classical channel and a quantum channel to transmit the information: the measurement output at the Tx node (2 bits), and the entangled photon qubit pair.…”

Section: Comparison With Network-on-chipmentioning

confidence: 99%

Modelling Short-range Quantum Teleportation for Scalable Multi-Core Quantum Computing Architectures

Rodrigo

Abadal

Almudéver

et al. 2021

Proceedings of the Eight Annual ACM International Conference on Nanoscale Computing and Communication

View full text Add to dashboard Cite

Multi-core quantum computing has been identified as a solution to the scalability problem of quantum computing. However, interconnecting quantum chips is not trivial, as quantum communications have their share of quantum weirdness: quantum decoherence and the no-cloning theorem makes transferring qubits a harsh challenge, where every extra nanosecond counts and retransmission is simply impossible. In this paper, we present our first steps towards thorough modeling of quantum communications for multicore quantum computers, which may be considered as a middle point between the well-known paradigms of Quantum Internet and Network-on-Chip. In particular, we stress the deep entanglement that exists between latency and error rates in quantum computing, and how this affects the quantum network design for this scenario. Moreover, we show the concomitant trade-off between computation and communication resources for a set of parameters out of state-of-the-art experimental research. The observed behavior lets us foresee the potential of multi-core quantum architectures. CCS CONCEPTS• Computer systems organization → Quantum computing; Distributed architectures; • Networks → Network on chip.

show abstract

“…Graph Partitioning Based on Resource: In some of the emerging architectures, graph partitioning algorithms can be used to assign tasks to the best execution unit available. In [44], a wafer-scale architecture is proposed to minimize communication overheads and memory access latency. It aims to schedule TBs with high data sharing to adjacent processing modules.…”

Section: Related Workmentioning

confidence: 99%

Paver

Tripathy

Abdolrashidi

Bhuyan

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored. We present PAVER , a P riority- A ware V ertex schedul ER , which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among thread blocks ( TBs ) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.

show abstract

Architecting Waferscale Processors - A GPU Case Study

Cited by 36 publications

References 55 publications

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

Modelling Short-range Quantum Teleportation for Scalable Multi-Core Quantum Computing Architectures

Paver

Contact Info

Product

Resources

About