Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

Howison, Mark; Bethel, E. Wes; Childs, Hank

doi:10.1109/tvcg.2011.24

Cited by 56 publications

(44 citation statements)

References 19 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although it is reasonable to ignore the shared memory between the four cores on Intrepid, future computers will have many more cores per node. Some introductory work has analyzed the behavior of image compositing in shared-memory architectures [7,19,21,23], but further refinement is required to take advantage of the hybrid distributed memory plus shared memory architecture of large systems and to evolve the compositing as architectures and rendering algorithms change.…”

Section: Discussionmentioning

confidence: 99%

An image compositing solution at scale

Moreland

Kendall

Peterka

et al. 2011

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

The only proven method for performing distributed-memory parallel rendering at large scales, tens of thousands of nodes, is a class of algorithms called sort last. The fundamental operation of sort-last parallel rendering is an image composite, which combines a collection of images generated independently on each node into a single blended image. Over the years numerous image compositing algorithms have been proposed as well as several enhancements and rendering modes to these core algorithms. However, the testing of these image compositing algorithms has been with an arbitrary set of enhancements, if any are applied at all. In this paper we take a leading production-quality imagecompositing framework, IceT, and use it as a testing framework for the leading image compositing algorithms of today. As we scale IceT to ever increasing job sizes, we consider the image compositing systems holistically, incorporate numerous optimizations, and discover several improvements to the process never considered before. We conclude by demonstrating our solution on 64K cores of the Intrepid BlueGene/P at Argonne National Laboratories.

show abstract

Section: Discussionmentioning

confidence: 99%

An image compositing solution at scale

Moreland

Kendall

Peterka

et al. 2011

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…Our results suggest a wide variation in performance can result -up to 254% on multi-core CPUs and 265% on many-core GPUs. We used the findings of this study to set tunable algorithmic parameters for a set of extreme-concurrency runs [18,19] that required literally millions of CPU hours; by finding and using optimal settings for tunable algorithmic parameters, we in effect saved millions of additional CPU hours that would have been spent executing an application in a non-optimal configuration. This work, which uses a well-established methodology for finding optimal performance, shows that such a methodology can be useful for visualization algorithms as well, and that the algorithmic parameters that produce the best performance vary from problem to problem and platform to platform, often in a non-obvious way.…”

Section: Discussionmentioning

confidence: 99%

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

Bethel

Howison

2012

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. And, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

show abstract

“…Another class of large-scale volume rendering systems is purely CPU-based, in order to avoid GPU memory limitations altogether. Much research has been devoted to volume rendering on large supercomputers [5,16,26]. This is especially useful in the context of in-situ visualization of large-scale simulations, where the visualization is computed on the same machine as the data, avoiding the need to move large data.…”

Section: Related Workmentioning

confidence: 99%

Interactive Volume Exploration of Petascale Microscopy Data Streams Using a Visualization-Driven Virtual Memory Approach

Hadwiger

Beyer

Jeong

et al. 2012

IEEE Trans. Visual. Comput. Graphics

150

View full text Add to dashboard Cite

Fig. 1.Our system is the first to enable neuroscientists to interactively explore petascale volume data resulting from high-throughput electron microscopy data streams. The volume can be visualized while the acquisition is still in progress, and without pre-processing all data into a 3D multi-resolution hierarchy as required by all previous systems. Shown here: 21, 494 ⇥ 25, 790 ⇥ 1, 850 mouse cortex.Abstract-This paper presents the first volume visualization system that scales to petascale volumes imaged as a continuous stream of high-resolution electron microscopy images. Our architecture scales to dense, anisotropic petascale volumes because it: (1) decouples construction of the 3D multi-resolution representation required for visualization from data acquisition, and (2) decouples sample access time during ray-casting from the size of the multi-resolution hierarchy. Our system is designed around a scalable multi-resolution virtual memory architecture that handles missing data naturally, does not pre-compute any 3D multi-resolution representation such as an octree, and can accept a constant stream of 2D image tiles from the microscopes. A novelty of our system design is that it is visualization-driven: we restrict most computations to the visible volume data. Leveraging the virtual memory architecture, missing data are detected during volume ray-casting as cache misses, which are propagated backwards for on-demand out-of-core processing. 3D blocks of volume data are only constructed from 2D microscope image tiles when they have actually been accessed during ray-casting. We extensively evaluate our system design choices with respect to scalability and performance, compare to previous best-of-breed systems, and illustrate the effectiveness of our system for real microscopy data from neuroscience.

show abstract

Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

Cited by 56 publications

References 19 publications

An image compositing solution at scale

An image compositing solution at scale

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

Interactive Volume Exploration of Petascale Microscopy Data Streams Using a Visualization-Driven Virtual Memory Approach

Contact Info

Product

Resources

About