HSTREAM: A Directive-Based Language Extension for Heterogeneous Stream Computing

Memeti, Suejb; Pllana, Sabri

doi:10.1109/cse.2018.00026

Cited by 10 publications

(5 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In general, as long as the capacity of the cache/shared memory is not exceeded, the larger the tile size, the better the data locality. As shown in Figure 3, the sizes (16,32) clause of the tile directive indicates that the matrices A, B, and C are split into (M∕16 + 1) × (N∕32 + 1) tiles respectively and the two nested for-loops are transformed into four nested for-loops.…”

Section: Loop Optimizationmentioning

confidence: 99%

“…OmpSs 31 is a task‐based parallel programming model composed of a set of directives and library routines, which enables the effective parallelization of applications across multiple heterogeneous devices (such as GPUs and FPGAs). HSTREAM 32 is a high‐level parallel programming model based on OpenMP‐like compiler directives, which enables programmers to easily develop stream computing applications that can be cooperatively performed on both multi‐core CPUs and accelerators. AIRA 33 is a programming framework that supports the flexible execution of compute kernels written using standard OpenMP directives and clauses on heterogeneous CPU‐GPU platforms.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

Wan,

Cui,

et al. 2024

Concurrency and Computation

View full text Add to dashboard Cite

Heterogeneous platforms composed of multiple different types of computing devices (such as CPUs, GPUs, and Intel MICs) have been widely used recently. However, most of parallel applications developed in such a heterogeneous platform usually only utilize a certain kind of computing device due to the lack of easy‐to‐use heterogeneous cooperative parallel programming models. To reduce the difficulty of heterogeneous cooperative parallel programming, a directive‐based heterogeneous cooperative parallel programming framework called HeteroPP is proposed. HeteroPP provides an easier way for programmers to fully exploit multiple different types of computing devices to concurrently and cooperatively perform data‐parallel applications on heterogeneous platforms. An extension to OpenMP directives and clauses is proposed to make it possible for programmers to easily offload a data‐parallel compute kernel to multiple different types of computing devices. A source‐to‐source compiler is designed to help programmers to automatically generate multiple device‐specific compute kernels that can be concurrently and cooperatively performed on heterogeneous platforms. Many experiments are conducted with 12 typical data‐parallel applications implemented with HeteroPP on a heterogeneous CPU‐GPU‐MIC platform. The results show that HeteroPP not only greatly simplifies the heterogeneous cooperative parallel programming, but also can fully utilize the CPUs, GPU, and MIC to efficiently perform these applications.

show abstract

Section: Loop Optimizationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

Wan,

Cui,

et al. 2024

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Other works look at performance optimization for numerical solvers [38], sparse matrix vector multiplication [39], [40], and dynamic stochastic economic models [39]. Ferrão et al [41] and Memeti et al [42] develop a stream processing framework for XeonPhi to increase the programming productivity. The runtime can automatically distribute workloads across CPUs and accelerating devices.…”

Section: Domain-specific Optimizationsmentioning

confidence: 99%

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Zhang

Fang

Yang

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks -a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.

show abstract

“…As presented in [68], many powerful HPC systems are heterogeneous, in the sense that they combine general-purpose CPUs with accelerators such as, Graphics Processing Units (GPUs), or Field Programable Gate Arrays (FPGAs) [69]. Several HPC approaches exist [70,71,72,73] developed to improve the performance of advanced and data intensive modeling and simulation applications. The parallel computing paradigm may be used on multicore CPUs, many-core processing units (such as, GPUs [74]), re-configurable hardware platforms (such as FPGAs), or over distributed infrastructure (such as, cluster, Grid, or Cloud).…”

Section: Task Parallelization and High-performance Computingmentioning

confidence: 99%