Architecture of the VPP500 parallel supercomputer

Utsumi, Teruo; Ikeda, Masayuki; Takamura, Moriyuki

doi:10.1145/602770.602852

Cited by 4 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many vector processors, such as the Cray-1 [11] and the VPP500 [15], take a similar approach. On almost all vector processors, the only way to compress or expand a vector is through gather/scatter instructions that cycle data through memory.…”

Section: Related Workmentioning

confidence: 99%

“…Data-parallel architectures, such as vector [2] [11] [15] [16], SIMD [3] [12], and stream [9] processors, are well suited to extracting this data parallelism, achieving very high levels of performance. They utilize partitioned register files and reduced control overhead in order to support 10s to 100s of ALUs efficiently on a single chip [10].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient conditional operations for data-parallel architectures

Kapasi

Dally

Rixner

et al. 2000

Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture

View full text Add to dashboard Cite

Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel architectures are designed to exploit this regularity by performing the same operation on many data elements concurrently. However, applications containing data-dependent control constructs perform poorly on these architectures. Conditional streams convert these constructs into data-dependent data movement. This allows data-parallel architectures to efficiently execute applications with data-dependent control flow. Essentially, conditional streams extend the range of applications that a data-parallel architecture can execute efficiently. For example, polygon rendering speeds up by a factor of 1.8 with the use of conditional streams.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Efficient conditional operations for data-parallel architectures

Kapasi

Dally

Rixner

et al. 2000

Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture

View full text Add to dashboard Cite

show abstract

“…The coat for execution of the optimised veraion of our fragment with K segmenta ia and aaauming that the communication time ia greater than the computation time, it becomes We can obtain the value of c directly from Tcomp(N). To determine a and b, however, we repeated the non-optimized execution with a different data set size (N'=2 14 ), obtaining Tcommun(N')=187.6 msec on the iPSC/860 and T....,.,..,.. (N')=10.9 msec on the Paragon.…”

Section: Problema With Current Systemsmentioning

confidence: 99%

“…Indeed, vector processing capability exista on some current parallel systems. On the CM-5 [13], there are separate scalar and vector processors on each node, whereas the Fujitsu VPPSOO [14) uses a traditional vector processor with scalar and vector functional units. To our knowledge, however, none of the existing systems provides communication support as a native processar feature.…”

Section: Related Workmentioning

confidence: 99%

lntegrating Message-Passing with Vector Architectures

Mendes

1995

Anais Do VII Simpósio De Arquitetura De Computadores E Processamento De Alto Desempenho (SBAC-PAD 1995)

View full text Add to dashboard Cite

Vector architecures proride excellent computational throughput, while successfully tolerating memory latency by pipelining memory accesses. In this paper, we propose a generalization of vector architectures to message-passing multicomputers, which combines the efficiency of vector computation with the scalablity of distributed-memory systems. In our proposed architecture, each node is a conventional vector processor (with chaining capability and pipelined functional units) augmented by native instructions to send and receive messages through vector registers. In this scheme, inter-node communication can be performed via vector-send/receive instructions, gaining the benefits of communication pipelining, reduced memory copies (memory-to-repter-to-register instead of memory-to-memory-to-cache), and lower communication latency (due to tight processor-communication coupling). We show that this strong integration between functional and communication units can lead to substantial performance improvement over conventional message-passing multicomputers. We model pipelined computation-communication systems both analytically and with a detailed construction-level simulation, and compare this simulation data with empirical results from an Intel Paragon. Preliminary data from a matrix multiplication example indicates our proposed vector-parallel architecture often significant scalability benefits over existing message-passing systems.

show abstract

“…In turn, the availability of precise exceptions allows the introduction of virtual memory. Virtual memory has been implemented in vector machines [15], but is not used in many current high performance parallel vector processors [7]. Or, it is used in a very restricted form, for example by locking pages containing vector data in memory while a vector program executes [7, 141. The primary problem with implementing precise page faults in a high performance vector machine is the high number of overlapped "in-flight" operations -in some machines there may be several hundred.…”

Section: Implementing Precise Trapsmentioning

confidence: 99%

Out-of-order vector architectures

Espasa¹,

Valero²,

Smith³

Proceedings of 30th Annual International Symposium on Microarchitecture

View full text Add to dashboard Cite

Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24-1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts-generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15-20%.Peer ReviewedPostprint (published version

show abstract

Architecture of the VPP500 parallel supercomputer

Cited by 4 publications

References 0 publications

Efficient conditional operations for data-parallel architectures

Efficient conditional operations for data-parallel architectures

lntegrating Message-Passing with Vector Architectures

Out-of-order vector architectures

Contact Info

Product

Resources

About