Memory Optimized Dynamic Matrix Chain Multiplication Using Shared Memory in GPU

Biswas, Gargi; Mukherjee, Nandini

doi:10.1007/978-3-030-65621-8_10

Cited by 3 publications

(1 citation statement)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, Diwan and Tembhurne [13] designed an adaptive generalized mapping method to parallelize non-serial polyadic dynamic-programming problems that utilize GPUs, for efficient mapping of subproblems onto processing threads in each phase. Biswas and Mukherjee [14] proposed a new memory optimized technique and a versatile technique of utilizing shared memory in blocks of threads to minimize time for accessing dimensions of matrices on GPU architectures. On shared-memory architectures, Mabrouk [10] designed solutions based on loop transformations.…”

Section: Introductionmentioning

confidence: 99%

Coarse-grained multicomputer parallel algorithm using the four-splitting technique for the minimum cost parenthesizing problem

Lacmou Zeutouo,

Kengne Tchendji,

Myoupo

2023

Revue Africaine De Recherche en Informatique Et Mathématiques Appliquées

View full text Add to dashboard Cite

Dynamic programming is a technique widely used to solve several combinatory optimization problems. A well-known example is the minimum cost parenthesizing problem (MPP), which is usually used to represent a class of non-serial polyadic dynamic-programming problems. These problems are characterized by a strong dependency between subproblems. This paper outlines a coarse-grained multicomputer parallel solution using the four-splitting technique to solve the MPP. It is a partitioning technique consisting of subdividing the dependency graph into subgraphs (or blocks) of variable size and splitting large-size blocks into four subblocks to avoid communication overhead caused by a similar partitioning technique in the literature. Our solution consists in evaluating a block by computing and communicating each subblock of this block to reduce the latency time of processors which accounts for most of the global communication time. It requires O(n^3/p) execution time with O(k * \sqrt{p}) communication rounds. n is the input data size, p is the number of processors, and k is the number of times the size of blocks is subdivided.

show abstract

Section: Introductionmentioning

confidence: 99%