“…Similar mechanisms for translating an irregular pattern of array accesses into inter-node messages have been proposed in order to make it possible to efficiently distribute loops where some array references are made through a level of indirection. Work on this topic was presented by the present authors in [12,8,9] as well as by Mehrotra and Van Rosendale [7,6].…”
We consider optimizations that are required for efficient execution of code segments that consist of loops over distributed data structures. The PARTI execution time primitives are designed to perform these optimizations and can be used to Implement a wide range of scientific algorithms on distributed memory machines.These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to perform gather and scatter operations on distributed arrays. Communications patterns are derived at run time, and the appropriate send and receive messages are automatically generated.
“…Similar mechanisms for translating an irregular pattern of array accesses into inter-node messages have been proposed in order to make it possible to efficiently distribute loops where some array references are made through a level of indirection. Work on this topic was presented by the present authors in [12,8,9] as well as by Mehrotra and Van Rosendale [7,6].…”
We consider optimizations that are required for efficient execution of code segments that consist of loops over distributed data structures. The PARTI execution time primitives are designed to perform these optimizations and can be used to Implement a wide range of scientific algorithms on distributed memory machines.These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to perform gather and scatter operations on distributed arrays. Communications patterns are derived at run time, and the appropriate send and receive messages are automatically generated.
“…The data streams can be reoriented with respect to the processor coordinates to reduce data dependence distances. The input data streams, Xin and W, can migrate in the computation lattice along the processor coordinate system to minimize communication [30].…”
Section: Synthesis Constraints Due To Implementationmentioning
AbstractÐPortable image processing applications require an efficient, scalable platform with localized computing regions. This paper presents a new class of area I/O systolic architecture to exploit the physical data locality of planar data streams by processing data where it falls. A synthesis technique using dependence graphs, data partitioning, and computation mapping is developed to handle planar data streams and to systematically design arrays with area I/O. Simulation results show that the use of area I/O provides a 16 times speedup over systems with perimeter I/O. Performance comparisons for a set of signal processing algorithms show that systolic arrays that consider planar data streams in the design process are up to three times faster than traditional arrays.Index TermsÐParallel computer architecture, systolic arrays, area I/O, design and performance evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.