Abstract:We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and co… Show more
“…In order to evaluate the performance of the proposed methods, we have compared the implementations to efficient CPU 3 and GPU 1,18 implementations of the IWPP.We have also benchmarked the processors using the STREAM benchmark 19 to compute regular memory access bandwidth. The memory bandwidth with these benchmarks are presented in Table 3.…”
Section: | Experimental Evaluationmentioning
confidence: 99%
“…1 The multi-core CPU version was executed on Intel E5-processors with 2.6 GHz and employed the 16 computing cores available. The SE10P and 7120P devices are equipped with 61 cores, but the latter has a higher clock rate.…”
Section: | Experimental Evaluationmentioning
confidence: 99%
“…1 Some core operations in image analysis applications share a core computation structure called Irregular Wavefront Propagation Pattern (IWPP). 1 For instance, segmentation operations that use the IWPP include: Morphological Reconstruction, 3 Fill Holes, 4 H-minima/maxima, 4 Watershed, 5 and Distance Transform. 6 …”
Summary
The Irregular Wavefront Propagation Pattern (IWPP) is a core computing structure in several image analysis operations. Efficient implementation of IWPP on the Intel Xeon Phi is difficult because of the irregular data access and computation characteristics. The traditional IWPP algorithm relies on atomic instructions, which are not available in the SIMD set of the Intel Phi. To overcome this limitation, we have proposed a new IWPP algorithm that can take advantage of non-atomic SIMD instructions supported on the Intel Xeon Phi. We have also developed and evaluated methods to use CPU and Intel Phi cooperatively for parallel execution of the IWPP algorithms. Our new cooperative IWPP version is also able to handle large out-of-core images that would not fit into the memory of the accelerator. The new IWPP algorithm is used to implement the Morphological Reconstruction and Fill Holes operations, which are operations commonly found in image analysis applications. The vectorization implemented with the new IWPP has attained improvements of up to about 5× on top of the original IWPP and significant gains as compared to state-of-the-art the CPU and GPU versions. The new version running on an Intel Phi is 6.21× and 3.14× faster than running on a 16-core CPU and on a GPU, respectively. Finally, the cooperative execution using two Intel Phi devices and a multi-core CPU has reached performance gains of 2.14× as compared to the execution using a single Intel Xeon Phi.
“…In order to evaluate the performance of the proposed methods, we have compared the implementations to efficient CPU 3 and GPU 1,18 implementations of the IWPP.We have also benchmarked the processors using the STREAM benchmark 19 to compute regular memory access bandwidth. The memory bandwidth with these benchmarks are presented in Table 3.…”
Section: | Experimental Evaluationmentioning
confidence: 99%
“…1 The multi-core CPU version was executed on Intel E5-processors with 2.6 GHz and employed the 16 computing cores available. The SE10P and 7120P devices are equipped with 61 cores, but the latter has a higher clock rate.…”
Section: | Experimental Evaluationmentioning
confidence: 99%
“…1 Some core operations in image analysis applications share a core computation structure called Irregular Wavefront Propagation Pattern (IWPP). 1 For instance, segmentation operations that use the IWPP include: Morphological Reconstruction, 3 Fill Holes, 4 H-minima/maxima, 4 Watershed, 5 and Distance Transform. 6 …”
Summary
The Irregular Wavefront Propagation Pattern (IWPP) is a core computing structure in several image analysis operations. Efficient implementation of IWPP on the Intel Xeon Phi is difficult because of the irregular data access and computation characteristics. The traditional IWPP algorithm relies on atomic instructions, which are not available in the SIMD set of the Intel Phi. To overcome this limitation, we have proposed a new IWPP algorithm that can take advantage of non-atomic SIMD instructions supported on the Intel Xeon Phi. We have also developed and evaluated methods to use CPU and Intel Phi cooperatively for parallel execution of the IWPP algorithms. Our new cooperative IWPP version is also able to handle large out-of-core images that would not fit into the memory of the accelerator. The new IWPP algorithm is used to implement the Morphological Reconstruction and Fill Holes operations, which are operations commonly found in image analysis applications. The vectorization implemented with the new IWPP has attained improvements of up to about 5× on top of the original IWPP and significant gains as compared to state-of-the-art the CPU and GPU versions. The new version running on an Intel Phi is 6.21× and 3.14× faster than running on a 16-core CPU and on a GPU, respectively. Finally, the cooperative execution using two Intel Phi devices and a multi-core CPU has reached performance gains of 2.14× as compared to the execution using a single Intel Xeon Phi.
“…By stepping along the incremental direction of f and processing all elements associated, data dependencies can be respected. So far, all the existing implementations of wavefront applications on GPUs adopt this data-parallel pattern [22,31,32]. Figure 6 illustrates the processing trace of this pattern for the Needleman-Wunsch algorithm.…”
The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The dataparallel programming model assumes a single instruction stream for multiple concurrent threads (SIMT); therefore little support is offered to enforce thread ordering and finegrained synchronizations. This becomes an obstacle when migrating algorithms which exploit fine-grained parallelism, to GPUs, such as the dataflow algorithms.In this paper, we propose a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs. We demonstrate its performance and compare it with other fine-grained and medium-grained synchronization approaches. Our method achieves 1.5x speedup over the warp-barrier based approach and 4.0x speedup over the atomic spin-lock based approach on average. To further explore the possibility of realizing fine-grained dataflow algorithms on GPUs, we apply the proposed synchronization scheme to Needleman-Wunsch -a 2D wavefront application involving massive cross-loop data dependencies. Our implementation achieves 3.56x speedup over the atomic spin-lock implementation and 1.15x speedup over the conventional data-parallel implementation for a basic sub-grid, which implies that the fine-grained, lock-based programming pattern could be an alternative choice for designing general-purpose GPU applications (GPGPU).
“…To port these algorithms to the GPU, we have implemented a hierarchical and scalable queue to store elements (pixels) in fast GPU memories along with several optimizations to reduce execution time. We refer the reader to the following manuscripts [8], [9] for implementation details. The queue-based implementation resulted in significant performance improvements over previously published GPU-enabled versions of the MR algorithm [10].…”
Section: Application Parallelization For High Throughput Executionmentioning
Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.