The WaveScalar architecture

Swanson, Steven; Schwerin, Andrew; Mercaldi, Martha; Petersen, Andrew; Putnam, Andrew; Michelson, Ken; Oskin, Mark; Eggers, Susan J.

doi:10.1145/1233307.1233308

Cited by 119 publications

(104 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In fact, parallelism breaks the computation into finer-grain chunks on separate regions. This reduces global memory accesses by leveraging local storage in regions and optional local memories, which is analogous to spatial computing approaches in classical computing [14,45,46].…”

Section: The Multi-simd Architectural Modelmentioning

confidence: 99%

Compiler Management of Communication and Parallelism for Quantum Computation

Heckey

Patil

Javadi-Abhari

et al. 2015

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Quantum computing (QC) offers huge promise to accelerate a range of computationally intensive benchmarks. Quantum computing is limited, however, by the challenges of decoherence: i.e., a quantum state can only be maintained for short windows of time before it decoheres. While quantum error correction codes can protect against decoherence, fast execution time is the best defense against decoherence, so efficient architectures and effective scheduling algorithms are necessary. This paper proposes the Multi-SIMD QC architecture and then proposes and evaluates effective schedulers to map benchmark descriptions onto Multi-SIMD architectures. The Multi-SIMD model consists of a small number of SIMD regions, each of which may support operations on up to thousands of qubits per cycle.Efficient Multi-SIMD operation requires efficient scheduling. This work develops schedulers to reduce communication requirements of qubits between operating regions, while also improving parallelism.We find that communication to global memory is a dominant cost in QC. We also note that many quantum benchmarks have long serial operation paths (although each operation may be data parallel). To exploit this characteristic, we introduce LongestPath-First Scheduling (LPFS) which pins operations to SIMD regions to keep data in-place and reduce communication to memory.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ASPLOS '15, March 14-18, 2015, Istanbul, Turkey. Copyright c 2015 ACM 978-1-4503-2835-7/15/03. . . $15.00. http://dx.doi.org/10.1145 The use of small, local scratchpad memories also further reduces communication. Our results show a 3% to 308% improvement for LPFS over conventional scheduling algorithms, and an additional 3% to 64% improvement using scratchpad memories. Our work is the most comprehensive software-to-quantum toolflow published to date, with efficient and practical scheduling techniques that reduce communication and increase parallelism for full-scale quantum code executing up to a trillion quantum gate operations.

show abstract

Section: The Multi-simd Architectural Modelmentioning

confidence: 99%

Compiler Management of Communication and Parallelism for Quantum Computation

Heckey

Patil

Javadi-Abhari

et al. 2015

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…Tiled architectures, such as Raw [Taylor et al 2004], TRIPS [Sankaralingam et al 2003], and WaveScalar [Swanson et al 2007], are a common approach to improving scalability because they reduce wire delay. Scalable CoDA systems also use a tiled architecture for this reason and to distribute coprocessors among multiple memory and host interfaces.…”

Section: Related Workmentioning

confidence: 99%

Exploring Energy Scalability in Coprocessor-Dominated Architectures for Dark Silicon

Zheng

Goulding-Hotta

Ricketts

et al. 2014

ACM Trans. Embed. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

As chip designers face the prospect of increasingly dark silicon, there is increased interest in incorporating energy-efficient specialized coprocessors into general-purpose designs. For specialization to be a viable means of leveraging dark silicon, it must provide energy savings over the majority of execution for large, diverse workloads, and this will require deploying coprocessors in large numbers. Recent work has shown that automatically-generated application-specific coprocessors can greatly improve energy efficiency, but it is not clear that current techniques will scale to coprocessor-dominated architectures (CoDAs) with hundreds or thousands of coprocessors.We show that scaling CoDAs to include very large numbers of coprocessors is challenging because of the energy cost of interconnects, the memory system, and leakage. These overheads grow with the number of coprocessors and, left unchecked, will squander the energy gains that coprocessors can provide. The paper presents a detailed study of energy costs across a wide range of tiled CoDA designs and shows that careful choice of cache configuration, tile size, coarse-grain power management, and transistor implementation can limit the growth of these overheads. For multi-threaded workloads, designers must also take care to avoid excessive contention for coprocessors, which can significantly increase energy consumption. The results suggest that, for CoDAs that target larger workloads, amortizing shared overheads via multithreading can provide up to 3.8× reductions in energy per instruction, retaining much of the 5.3× potential of smaller designs.

show abstract

“…The dataflow model [7], is one of the major contenders in meeting the above criteria. Unfortunately this model is relatively inefficient and also has difficulty in capturing the imperative programming style.…”

Section: Introductionmentioning

confidence: 99%

“…Unfortunately this model is relatively inefficient and also has difficulty in capturing the imperative programming style. The former is illustrated in [7], which describes a loop summing its own index that requires seven instructions in its body, six instructions of overhead for a single operation. For comparison, the sequential model has an overhead of two instructions to implement the same loop and more to the point, the micro-architecture described here has a zero-instruction overhead.…”

Section: Introductionmentioning

confidence: 99%

A general model of concurrency and its implementation as many-core dynamic RISC processors

Bernard

Bousias

Guang

et al. 2008

2008 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation

View full text Add to dashboard Cite

A general model of concurrency and its implementation as many-core dynamic RISC processors Bernard, T.A.M.; Bousias, K.; Guang, L.; Jesshope, C.R.; Lankamp, M.; van Tol, M.W.; Zhang, L. General rightsIt is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulationsIf you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: http://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Abstract-This paper presents a concurrent execution model and its micro-architecture based on in-order RISC processors, which schedules instructions from large pools of contextualised threads. The model admits a strategy for programming chip multiprocessors using parallelising compilers based on existing languages. The model is supported in the ISA by number of instructions to create and manage abstract concurrency. The paper estimates the cost of supporting these instructions in silicon. The model and its implementation uses dynamic parameterisation of concurrency creation, where a single instruction captures asynchronous remote function execution, mutual exclusion and the execution of a general concurrent loop structure and all associated communication. Concurrent loops may be dependent or independent, bounded or unbounded and may be nested arbitrarily. Hierarchical concurrency allows compilers to restructure and parallelise sequential code to meet the strict constraints on the model, which provide its freedom from deadlock and locality of communication. Communication is implicit in both the model and micro-architecture, due to the dynamic distribution of concurrency. The result is location-independent binary code that may execute on any number of processors. Simulation and analysis of the micro-architecture indicate that the model is a strong candidate for the exploitation of many-core processors. The results show near-linear speedup over two orders of magnitude of processor scaling, good energy efficiency and tolerance to large latencies in asynchronous operations. This is true for both independent threads as well as for reductions.

show abstract

The WaveScalar architecture

Cited by 119 publications

References 56 publications

Compiler Management of Communication and Parallelism for Quantum Computation

Compiler Management of Communication and Parallelism for Quantum Computation

Exploring Energy Scalability in Coprocessor-Dominated Architectures for Dark Silicon

A general model of concurrency and its implementation as many-core dynamic RISC processors

Contact Info

Product

Resources

About