Ahmed E. Helal scite author profile

Abstract-To attain scalable performance efficiently, the HPC community expects future exascale systems to consist of multiple nodes, each with different types of hardware accelerators. In addition to GPUs and Intel MICs, additional candidate accelerators include embedded multiprocessors and FPGAs. End users need appropriate tools to efficiently use the available compute resources in such systems, both within a compute node and across compute nodes. As such, we present MetaMorph, a library framework designed to (automatically) extract as much computational capability as possible from HPC systems. Its design centers around three core principles: abstraction, interoperability, and adaptivity. To demonstrate its efficacy, we present a case study that uses the structured grids design pattern, which is heavily used in computational fluid dynamics. We show how MetaMorph significantly reduces the development time, while delivering performance and interoperability across an array of heterogeneous devices, including multicore CPUs, Intel MICs, AMD GPUs, and NVIDIA GPUs.

show abstract

Parallel circuit simulation using the direct method on a heterogeneous cloud

Helal

Bayoumi

Hanafy

2015

View full text Add to dashboard Cite

This paper discusses the development of a parallel SPICE circuit simulator using the direct method on a cloud-based heterogeneous cluster, which includes multiple HPC compute nodes with multi-sockets, multicores, and GPUs. A simple model is derived to optimally partition the circuit between the compute nodes. The parallel simulator is divided into four major kernels: Partition Device Model Evaluation (PME), Partition Matrix Factorization (PMF), Interconnection Matrix Evaluation (IME), and Interconnection Matrix Factorization (IMF). Another model is derived to assign each of the kernels to the most suitable execution platform of the Amazon EC2 heterogeneous cloud. The partitioning approach using heterogeneous resources has achieved an order-of-magnitude speedup over optimized multithreaded implementations of SPICE using state of the art KLU and NICSLU packages for matrix solution.

show abstract

Exploring FPGA-specific Optimizations for Irregular OpenCL Applications

Hassan

Helal

Athanas

et al. 2018

View full text Add to dashboard Cite

Alto

Helal

Laukemann

Checconi

et al. 2021

View full text Add to dashboard Cite

Efficient, out-of-memory sparse MTTKRP on massively parallel architectures

Nguyen¹,

Helal

Checconi

et al. 2022

View full text Add to dashboard Cite

Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. In contrast to prior work, the proposed Blocked Linearized CoOrdinate (BLCO) format enables efficient out-of-memory computation of tensor algorithms using a unified implementation that works on a single tensor copy. Our adaptive blocking and linearization strategies not only meet the resource constraints of GPU devices, but also accelerate data indexing, eliminate control-flow and memoryaccess irregularities, and reduce kernel launching overhead. To address the substantial synchronization cost on GPUs, we introduce an opportunistic conflict resolution algorithm, in which threads collaborate instead of contending on memory access to discover and resolve their conflicting updates on-the-fly, without keeping any auxiliary information or storing non-zero elements in specific mode orientations. As a result, our framework delivers superior in-memory performance compared to prior state-of-the-art, and is the only framework capable of processing out-of-memory tensors. On the latest Intel and NVIDIA GPUs, BLCO achieves 2.12 − 2.6× geometric-mean speedup (with up to 33.35× speedup) over the state-of-the-art mixed-mode compressed sparse fiber (MM-CSF) on a range of real-world sparse tensors. CCS CONCEPTS• Mathematics of computing → Mathematical software performance; • Computing methodologies → Massively parallel algorithms.

show abstract

A Composable Workflow for Productive Heterogeneous Computing on FPGAs via Whole-Program Analysis and Transformation

Sathre

Helal

Feng

2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ahmed E. Helal

Adaptive Task Aggregation for High-Performance Sparse Solvers on GPUs

AutoMatch: An automated framework for relative performance estimation and workload distribution on heterogeneous HPC systems

MetaMorph: A Library Framework for Interoperable Kernels on Multi- and Many-Core Clusters

Parallel circuit simulation using the direct method on a heterogeneous cloud

Exploring FPGA-specific Optimizations for Irregular OpenCL Applications

Alto

Efficient, out-of-memory sparse MTTKRP on massively parallel architectures

A Composable Workflow for Productive Heterogeneous Computing on FPGAs via Whole-Program Analysis and Transformation

Contact Info

Product

Resources

About