Massively parallel volume rendering using 2--3 swap image compositing

Yu, Hongfeng; Wang, Chaoli; Liu, Kwan

doi:10.1145/1508044.1508084

Cited by 46 publications

(40 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This approach has inefficiencies because processes have to sit idle during most of the computation. The 2-3 swap algorithm [34] takes a different approach. It relaxes binary swap such that processes can be grouped into pairs of two (like binary swap) or sets of three (unlike binary swap).…”

Section: Basic Parallel Compositing Algorithmsmentioning

confidence: 99%

“…These increased demands on sort-last rendering have spawned a resurgence in image compositing research. Recent studies led to the creation of new image compositing algorithms [20,34], and new compositing enhancements [8]. Although each of these studies improve the state of the art in image compositing, all involve locally built algorithm implementations that contain some isolated subset of enhancements.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An image compositing solution at scale

Moreland

Kendall

Peterka

et al. 2011

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

The only proven method for performing distributed-memory parallel rendering at large scales, tens of thousands of nodes, is a class of algorithms called sort last. The fundamental operation of sort-last parallel rendering is an image composite, which combines a collection of images generated independently on each node into a single blended image. Over the years numerous image compositing algorithms have been proposed as well as several enhancements and rendering modes to these core algorithms. However, the testing of these image compositing algorithms has been with an arbitrary set of enhancements, if any are applied at all. In this paper we take a leading production-quality imagecompositing framework, IceT, and use it as a testing framework for the leading image compositing algorithms of today. As we scale IceT to ever increasing job sizes, we consider the image compositing systems holistically, incorporate numerous optimizations, and discover several improvements to the process never considered before. We conclude by demonstrating our solution on 64K cores of the Intrepid BlueGene/P at Argonne National Laboratories.

show abstract

Section: Basic Parallel Compositing Algorithmsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An image compositing solution at scale

Moreland

Kendall

Peterka

et al. 2011

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…However, it will also generate p × (p − 1) messages to be exchanged among all participating processes. In a communication network where each of the participating processes are connected by network links, this will likely generate link contention as multiple processes will simultaneously be sending messages to the same process [3]. The execution time for direct-send is:…”

Section: Related Workmentioning

confidence: 99%

An Investigation into the Performance of Reduction Algorithms under Load Imbalance

Marendi

Lemeire

Haber

et al. 2012

Euro-Par 2012 Parallel Processing

View full text Add to dashboard Cite

Abstract. Today, most reduction algorithms are optimized for balanced workloads; they assume all processes will start the reduction at about the same time. However, in practice this is not always the case and significant load imbalances may occur and affect the performance of said algorithms. In this paper we investigate the impact of such imbalances on the most commonly employed reduction algorithms and propose a new algorithm specifically adapted to the presented context. Firstly, we analyze the optimistic case where we have a priori knowledge of all imbalances and propose a near-optimal solution. In the general case, where we do not have any foreknowledge of the imbalances, we propose a dynamically rebalanced tree reduction algorithm. We show experimentally that this algorithm performs better than the default OpenMPI and MVAPICH2 implementations.

show abstract

“…This lack of locality can result in "cache thrashing," which is a relatively low level of cache reuse. They report that object-parallel partitionings scale well, and this form of partitioning has been adopted as the basis for parallel work decomposition in many subsequent works (e.g., [4,35,40,10,20]). In contrast, our work here is a more comprehensive, systematic exploration of the relationship between algorithmic optimization and tunable algorithmic parameters -image tile size, work assignment strategy, and alternative memory layouts for the source data, and algorithmic optimizations -and their impact on algorithm performance in terms of runtime and cache utilization measured via hardware performance counters.…”

Section: Previous Workmentioning

confidence: 99%

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

Bethel

Howison

2012

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. And, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

show abstract

Massively parallel volume rendering using 2--3 swap image compositing

Cited by 46 publications

References 15 publications

An image compositing solution at scale

An image compositing solution at scale

An Investigation into the Performance of Reduction Algorithms under Load Imbalance

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

Contact Info

Product

Resources

About