Efficient irregular wavefront propagation algorithms on hybrid CPU–GPU machines

Teodoro, George; Pan, Tony; Kurç, Tahsin; Kong, Jun; Cooper, Lee; Saltz, Joel

doi:10.1016/j.parco.2013.03.001

Cited by 36 publications

(50 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In order to evaluate the performance of the proposed methods, we have compared the implementations to efficient CPU 3 and GPU 1,18 implementations of the IWPP.We have also benchmarked the processors using the STREAM benchmark 19 to compute regular memory access bandwidth. The memory bandwidth with these benchmarks are presented in Table 3.…”

Section: | Experimental Evaluationmentioning

confidence: 99%

“…1 The multi-core CPU version was executed on Intel E5-processors with 2.6 GHz and employed the 16 computing cores available. The SE10P and 7120P devices are equipped with 61 cores, but the latter has a higher clock rate.…”

Section: | Experimental Evaluationmentioning

confidence: 99%

“…1 Some core operations in image analysis applications share a core computation structure called Irregular Wavefront Propagation Pattern (IWPP). 1 For instance, segmentation operations that use the IWPP include: Morphological Reconstruction, 3 Fill Holes, 4 H-minima/maxima, 4 Watershed, 5 and Distance Transform. 6 …”

Section: | Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Gomes

Melo

Kong

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary The Irregular Wavefront Propagation Pattern (IWPP) is a core computing structure in several image analysis operations. Efficient implementation of IWPP on the Intel Xeon Phi is difficult because of the irregular data access and computation characteristics. The traditional IWPP algorithm relies on atomic instructions, which are not available in the SIMD set of the Intel Phi. To overcome this limitation, we have proposed a new IWPP algorithm that can take advantage of non-atomic SIMD instructions supported on the Intel Xeon Phi. We have also developed and evaluated methods to use CPU and Intel Phi cooperatively for parallel execution of the IWPP algorithms. Our new cooperative IWPP version is also able to handle large out-of-core images that would not fit into the memory of the accelerator. The new IWPP algorithm is used to implement the Morphological Reconstruction and Fill Holes operations, which are operations commonly found in image analysis applications. The vectorization implemented with the new IWPP has attained improvements of up to about 5× on top of the original IWPP and significant gains as compared to state-of-the-art the CPU and GPU versions. The new version running on an Intel Phi is 6.21× and 3.14× faster than running on a 16-core CPU and on a GPU, respectively. Finally, the cooperative execution using two Intel Phi devices and a multi-core CPU has reached performance gains of 2.14× as compared to the execution using a single Intel Xeon Phi.

show abstract

Section: | Experimental Evaluationmentioning

confidence: 99%

Section: | Experimental Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Gomes

Melo

Kong

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…By stepping along the incremental direction of f and processing all elements associated, data dependencies can be respected. So far, all the existing implementations of wavefront applications on GPUs adopt this data-parallel pattern [22,31,32]. Figure 6 illustrates the processing trace of this pattern for the Needleman-Wunsch algorithm.…”

Section: Wavefront Applicationmentioning

confidence: 99%

Fine-Grained Synchronizations and Dataflow Programming on GPUs

Braak

Corporaal

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

View full text Add to dashboard Cite

The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The dataparallel programming model assumes a single instruction stream for multiple concurrent threads (SIMT); therefore little support is offered to enforce thread ordering and finegrained synchronizations. This becomes an obstacle when migrating algorithms which exploit fine-grained parallelism, to GPUs, such as the dataflow algorithms.In this paper, we propose a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs. We demonstrate its performance and compare it with other fine-grained and medium-grained synchronization approaches. Our method achieves 1.5x speedup over the warp-barrier based approach and 4.0x speedup over the atomic spin-lock based approach on average. To further explore the possibility of realizing fine-grained dataflow algorithms on GPUs, we apply the proposed synchronization scheme to Needleman-Wunsch -a 2D wavefront application involving massive cross-loop data dependencies. Our implementation achieves 3.56x speedup over the atomic spin-lock implementation and 1.15x speedup over the conventional data-parallel implementation for a basic sub-grid, which implies that the fine-grained, lock-based programming pattern could be an alternative choice for designing general-purpose GPU applications (GPGPU).

show abstract

“…To port these algorithms to the GPU, we have implemented a hierarchical and scalable queue to store elements (pixels) in fast GPU memories along with several optimizations to reduce execution time. We refer the reader to the following manuscripts [8], [9] for implementation details. The queue-based implementation resulted in significant performance improvements over previously published GPU-enabled versions of the MR algorithm [10].…”

Section: Application Parallelization For High Throughput Executionmentioning

confidence: 99%

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms

Teodoro

Pan

Kurç

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Self Cite

View full text Add to dashboard Cite

Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system.

show abstract

Efficient irregular wavefront propagation algorithms on hybrid CPU–GPU machines

Cited by 36 publications

References 49 publications

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Fine-Grained Synchronizations and Dataflow Programming on GPUs

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms

Contact Info

Product

Resources

About

Efficient irregular wavefront propagation algorithms on hybrid CPU–GPU machines

Cited by 36 publications

References 49 publications

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel® Xeon Phi™

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel® Xeon Phi™

Fine-Grained Synchronizations and Dataflow Programming on GPUs

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms

Contact Info

Product

Resources

About

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™