Representation-transparent matrix algorithms with scalable performance

Gottschling, Peter; Wise, David S.; Adams, Michael D.

doi:10.1145/1274971.1274989

Cited by 32 publications

(18 citation statements)

References 19 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, methods for program transformation have been explored to improve the cache hit rate in uniprocessor systems, or to improve the data locality in distributed-memory parallel computers [9], [10]. Gottschling [11] proposed a representation-transparent matrix algorithm for multicore chip and developed matrix template library (MTL) for matrix application, such as matrix multiplication. Ruetsch [12] and Podlozhnyuk [13] combine features of the GPU warp access memory and the shared memory structure to enhance matrix transpose performance.…”

Section: Related Workmentioning

confidence: 99%

Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix Access

Guo

Tang

Dou

et al. 2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThe effective bandwidth of the dynamic random-access memory (DRAM) for the alternate row-wise/column-wise matrix access (AR/CMA) mode, which is a basic characteristic in scientific and engineering applications, is very low. Therefore, we propose the window memory layout scheme (WMLS), which is a matrix layout scheme that does not require transposition, for AR/CMA applications. This scheme maps one row of a logical matrix into a rectangular memory window of the DRAM to balance the bandwidth of the row-and column-wise matrix access and to increase the DRAM IO bandwidth. The optimal window configuration is theoretically analyzed to minimize the total number of no-data-visit operations of the DRAM. Different WMLS implementationsare presented according to the memory structure of field-programmable gata array (FPGA), CPU, and GPU platforms. Experimental results show that the proposed WMLS can significantly improve DRAM bandwidth for AR/CMA applications. achieved speedup factors of 1.6× and 2.0× are achieved for the general-purpose CPU and GPU platforms, respectively. For the FPGA platform, the WMLS DRAM controller is custom. The maximum bandwidth for the AR/CMA mode reaches 5.94 GB/s, which is a 73.6% improvement compared with that of the traditional row-wise access mode. Finally, we apply WMLS scheme for Chirp Scaling SAR application, comparing with the traditional access approach, the maximum speedup factors of 4.73X, 1.33X and 1.56X can be achieved for FPGA, CPU and GPU platform, respectively. key words: window memory layout scheme (WMLS), alternate rowwise/column-wise matrix access, SDRAM, GPU, FPGA

show abstract

Section: Related Workmentioning

confidence: 99%

Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix Access

Guo

Tang

Dou

et al. 2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…Sequential libraries such as STL [19], BGL [10], and MTL [9], provide data structures such as arrays, vectors, lists, maps, matrices, and graphs. A parallel container is an object oriented implementation of a data structure designed to be used efficiently in a parallel environment.…”

Section: The Stapl Parallel Containermentioning

confidence: 99%

“…pContainers can be constructed from any base container, sequential or parallel, so long as it can support the required interface. The pContainers currently provided in STAPL use the corresponding STL containers (e.g., the STAPL pVector uses the STL vector), containers from other sequential libraries (e.g., MTL [9] for matrices), containers available in libraries developed for multicore (e.g., TBB [14] concurrent containers), or other pContainers. This flexibility allows for code reuse and supports interoperability with other libraries.…”

Section: Pcontainer Definitionmentioning

confidence: 99%

“…Thus, the PCF makes developing a pContainer almost as easy as developing its sequential counterpart. Moreover, the PCF facilitates interoperability by enabling the use of parallel or sequential containers from other libraries, e.g., MTL [9], BGL [10] or TBB [14].…”

Section: The Parallel Container Framework (Pcf)mentioning

confidence: 99%

“…The STAPL runtime system includes a communication library (ARMI) and an executor that executes pRanges. Sequential libraries such as STL [19], BGL [10], and MTL [9], provide the user with a collection of data structures that simplifies the application development process. Similarly, STAPL provides the Parallel Container Framework (PCF) which includes a set of elementary pContainers and tools to facilitate the customization and specialization of existing pContainers and the development of new ones.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The STAPL parallel container framework

et al. 2011

View full text Add to dashboard Cite

The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainers that are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. In this work, we present the STAPL Parallel Container Framework (PCF), that is designed to facilitate the development of generic parallel containers. We introduce a set of concepts and a methodology for assembling a pContainer from existing sequential or parallel containers, without requiring the programmer to deal with concurrency or data distribution issues. The PCF provides a large number of basic parallel data structures (e.g., pArray, pList, pVector, pMatrix, pGraph, pMap, pSet). The PCF provides a class hierarchy and a composition mechanism that allows users to extend and customize the current container base for improved application expressivity and performance. We evaluate STAPL pContainer performance on a CRAY XT4 massively parallel system and show that pContainer methods, generic pAlgorithms, and different applications provide good scalability on more than 16,000 processors.

show abstract

Universal: Reliable, Reproducible, and Energy-Efficient Numerics

Omtzigt¹,

Quinlan

2022

Next Generation Arithmetic

View full text Add to dashboard Cite

Representation-transparent matrix algorithms with scalable performance

Cited by 32 publications

References 19 publications

Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix Access

Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix Access

The STAPL parallel container framework

Universal: Reliable, Reproducible, and Energy-Efficient Numerics

Contact Info

Product

Resources

About