Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

Gu, Rong; Tang, Yun; Tian, Chen; Zhou, Hucheng; Li, Guanru; Zheng, Xudong; Huang, Yihua

doi:10.1109/tpds.2017.2686384

Cited by 22 publications

(17 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, with the wide adoption of MapReduce or BSP-style data analytics in the cloud, a number of systems have implemented linear algebra libraries [10,22,26,29,37]. However, BSP programming models are ill-suited for expressing the ne-grained dependencies in linear algebra algorithms, and imposing global synchronous barriers often greatly slows down a job.…”

Section: Related Workmentioning

confidence: 99%

“…However, BSP programming models are ill-suited for expressing the ne-grained dependencies in linear algebra algorithms, and imposing global synchronous barriers often greatly slows down a job. As a result, none of these systems [10,22,26,29] have an implementation of distributed Cholesky decomposition that can compare with NumPyWren or ScaLAPACK.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Serverless linear algebra

Shankar

Krauth

Vodrahalli

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

Datacenter disaggregation provides numerous benets to both the datacenter operator and the application designer. However switching from the server-centric model to a disaggregated model requires developing new programming abstractions that can achieve high performance while beneting from the greater elasticity. To explore the limits of datacenter disaggregation, we study an application area that near-maximally benets from current server-centric datacenters: dense linear algebra. We build NumPyWren, a system for linear algebra built on a disaggregated serverless programming model, and LAmbdaPACK, a companion domainspecic language designed for serverless execution of highly parallel linear algebra algorithms. We show that, for a number of linear algebra algorithms such as matrix multiply, singular value decomposition, Cholesky decomposition, and QR decomposition, NumPyWren's performance (completion time) is within a factor of 2 of optimized server-centric MPI implementations, and has up to 15 % greater compute eciency (total CPU-hours), while providing fault tolerance.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Serverless linear algebra

Shankar

Krauth

Vodrahalli

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

show abstract

“…When compared to other multi-node methods, this approach performs equally well if not better. This is because in most MPI-based and distributed-computing approaches, reading data from files is mainly done by a single process [15]. Although the overall operation time could be reduced by overlapping communication and computation using clever heuristics [40], [41], it is still an extra exercise.…”

Section: G Comparison With Other Approachesmentioning

confidence: 99%

“…The approach adopted on distributed platforms, is generally based on a master-slave process model where blocks of data are broadcast by a node to other nodes in the cluster [15]. Another popular approach is to use tiling followed by batching, where tiling refers to the partitioning of the matrices into tiny blocks or tiles, while batching refers to the assignment of these tiles to threads or computing elements for computation.…”

Section: Related Workmentioning

confidence: 99%

“…Although this operation could naively be implemented using a simple three-nested loop with algorithmic complexity in O(n 3 ), it has been the subject of a lot of research over the last few decades. Indeed, areas related to the optimization of this operation and its parallelization using improvised algorithms and software paradigms and hardware architectures are still active areas of research [12], [9], [15], [3], [16], [17], [18], [19], [20]. More recently, with the advent of easily available yet powerful workstations equipped with sophisticated co-processors, researchers have parallelized GEMM on these single-node workstations using accelerators like Graphics Processing Units (GPUs), Many Integrated Cores (MICs) and Field Programmable Gate Arrays (FPGAs) [19], [21], [12], [17].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Extending Shared-Memory Computations to Multiple Distributed Nodes

Ahmed¹

2020

IJACSA

View full text Add to dashboard Cite

With the emergence of accelerators like GPUs, MICs and FPGAs, the availability of domain specific libraries (like MKL) and the ease of parallelization associated with CUDA and OpenMP based shared-memory programming, node-based parallelization has recently become a popular choice among developers in the field of scientific computing. This is evident from the large volume of recently published work in various domains of scientific computing, where shared-memory programming and accelerators have been used to accelerate applications. Although these approaches are suitable for small problem-sizes, there are issues that need to be addressed for them to be applicable to larger input domains. Firstly, the primary focus of these works has been to accelerate the core kernel; acceleration of input/output operations is seldom considered. Many operations in scientific computing operate on large matrices-both sparse and dense-that are read from and written to external files. These input-output operations present themselves as bottlenecks and significantly effect the overall application time. Secondly, node-based parallelization limits a developer from distributing the computation beyond a single node without him having to learn an additional programming paradigm like MPI. Thirdly, the problem size that can be effectively handled by a node is limited by the memory of the node and accelerator. In this paper, an Asynchronous Multi-node Execution (AMNE) approach is presented that uses a unique combination of the shared-file system and pseudo-replication to extend node-based algorithms to a distributed multiple node implementation with minimal changes to the original node-based code. We demonstrate this approach by applying it to GEMM, a popular kernel in dense linear algebra and show that the presented methodology significantly advances the state of art in the field of parallelization and scientific computing.

show abstract

AISFIP: Artificial Intelligence Smart Financial Information Platform with Concurrency Computation

Yang¹,

Yang

2022

2022 International Conference on Electronics and Renewable Systems (ICEARS)

View full text Add to dashboard Cite

Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

Cited by 22 publications

References 20 publications

Serverless linear algebra

Serverless linear algebra

Extending Shared-Memory Computations to Multiple Distributed Nodes

AISFIP: Artificial Intelligence Smart Financial Information Platform with Concurrency Computation

Contact Info

Product

Resources

About