Non-rigid Registration for Large Sets of Microscopic Images on Graphics Processors

Ruiz, Alberto; Ujaldón, Manuel; Cooper, Lee; Huang, Kun

doi:10.1007/s11265-008-0208-4

Cited by 31 publications

(22 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The process of registration is the focus of our high performance computing effort in this paper, which extends our previous work on a singleprocessor platform [2] to make use of massive parallelism. For a mouse mammary sample composed of 500 slides, it took more than 181 hours for our C++ code to accomplish the registration process on a high-end CPU.…”

Section: Introductionmentioning

confidence: 81%

Parallel Automatic Registration of Large Scale Microscopic Images on Multiprocessor CPUs and GPUs

Cooper

Huang

Ujaldón

2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

Self Cite

View full text Add to dashboard Cite

During the present decade, emerging architectures like multicore CPUs and graphics processing units (GPUs) have steadily gained popularity for their ability to deploy high computational power at a low cost. In this paper, we combine parallelization techniques on a cooperative cluster of multicore CPUs and multisocket GPUs to apply their joint computational power to an automatic image registration algorithm intended for the analysis of high-resolution microscope images. Registration methods pose a computational challenge within the biomedical field due to the large size of microscope image data sets, which typically extend to the Terabyte scale. We analyze this application to identify those parts which are more favorable to the CPU and GPU execution models and decompose the process accordingly. Performance results are presented for two sets of images: mouse placenta (16K × 16K pixels) and mouse mammary tumor (23K × 62K pixels). Execution times are shown on different multi-node, multi-socket and multi-core configurations to provide performance insights about the most effective approach.

show abstract

Section: Introductionmentioning

confidence: 81%

Parallel Automatic Registration of Large Scale Microscopic Images on Multiprocessor CPUs and GPUs

Cooper

Huang

Ujaldón

2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

Self Cite

View full text Add to dashboard Cite

show abstract

“…Levin et al [7] implemented a high-performance Thin Plate Spline (TPS) volume warping algorithm that accelerated the application of the TPS nonlinear transformation by combining hardware-accelerated 3D textures, vertex shaders, and trilinear interpolation. Antonio et al [8] used polynomial mapping as non-rigid transformation and achieved a factor of 4.11 speedup with a single GPU and 6.68 with a GPU pair over CPU-based NRR. Vetter et al [9] implemented non-rigid registration on a GPU using mutual information and the Kullback-Leibler divergence and reported GPU performed up to 5 times faster per iteration than the CPU implementation.…”

Section: B Gpu Related Workmentioning

confidence: 99%

“…Recently, some groups implemented it on Graphics Processing Units (GPUs) [7], [8], [9], [10], [11]. However, up to now there were no reports on accelerating NRR using the cooperative architecture: multicores and GPU, which is widely available in commodity PCs.…”

Section: Introductionmentioning

confidence: 99%

Real-Time Non-rigid Registration of Medical Images on a Cooperative Parallel Architecture

Liu

Fedorov

Kikinis

et al. 2009

2009 IEEE International Conference on Bioinformatics and Biomedicine

View full text Add to dashboard Cite

Abstract-Unacceptable execution time of Non-rigid registration (NRR) often presents a major obstacle to its routine clinical use. Parallel computing is an effective way to accelerate NRR. However, development of efficient parallel NRR codes is a very challenging task. One desirable approach is to map the existing sequential algorithm to the parallel architecture to gain speedup instead of designing a new parallel algorithm. Multicores and GPU provide us a cooperative architecture, in which both Single Instruction Multiple Data (SIMD) and Single Program Multiple Data (SPMD) programming models can co-exist and complement each other. We present a method to parallelize a NRR on this cooperative architecture. Our approach is first to separate the sequential algorithm into regular and irregular parts. We then map the regular part on GPU following SIMD paradigm and irregular part on multicores in a SPMD fashion. Unlike the approaches that use multicores or GPU alone, our approach leads to desirable speedup for the whole application by taking advantage of all components of the cooperative parallel architecture, for all individual parts of the application. This helps us to get closer to our goal: cheaper and faster NRR that leads to its more widespread use. The results of our evaluation on clinical brain MRI data show that the GPU-based Block Matching (regular part) can run at least 1.9 times faster than on a typical cluster of workstations with eight high-performance nodes. The multicores-based implementation of the incremental finite element solver (irregular part) achieves speedup of up to 7 times compared to its sequential version. As a result, the total run time of the NRR code can be reduced to less than 1 minute therefore satisfying the real time requirement for its clinical application.

show abstract

“…For instance, in [11], the authors port "large-scale, biomedical image analysis" applications to multi-core CPUs and GPUs, and compare different implementation strategies with each other. In [21], the authors study image registration and segmentation and accelerate those applications by using CUDA on a GPU. In [24], the authors use both the hardware parallelism and the special function units available on an NVIDIA GPU to dramatically improve the performance of an advanced MRI reconstruction algorithm.…”

Section: Programmable Loop Acceleratorsmentioning

confidence: 99%

Power-efficient medical image processing using PUMA

Dasika

Fan

Mahlke

2009

2009 IEEE 7th Symposium on Application Specific Processors

View full text Add to dashboard Cite

Abstract-Graphics processing units (GPUs) are becoming an increasingly popular platform to run applications that require a high computation throughput. They are limited, however, by memory bandwidth and power and, as such, cannot always achieve their full potential. This paper presents the PUMA architecture -a domain-specific accelerator designed specifically for medical imaging applications, but with sufficient generality to make it programmable. The goal is to closely match the performance achieved by GPUs in this domain but at a fraction of the power consumption. The results are quite promising -PUMA achieves upto 2X the performance of a modern GPU architecture and has upto a 54X improved efficiency on a floating-point and memory-intensive MRI reconstruction algorithm.

show abstract

Non-rigid Registration for Large Sets of Microscopic Images on Graphics Processors

Cited by 31 publications

References 29 publications

Parallel Automatic Registration of Large Scale Microscopic Images on Multiprocessor CPUs and GPUs

Parallel Automatic Registration of Large Scale Microscopic Images on Multiprocessor CPUs and GPUs

Real-Time Non-rigid Registration of Medical Images on a Cooperative Parallel Architecture

Power-efficient medical image processing using PUMA

Contact Info

Product

Resources

About