Patrick McCormick scite author profile

The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance-and power-related metrics.

show abstract

Using GPUs to improve multigrid solver performance on a cluster

Göddeke¹,

Strzodka²,

Mohd-Yusof³

et al. 2008

IJCSE

View full text Add to dashboard Cite

This article explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes.

show abstract

Interactive texture-based volume rendering for large data sets

Kniss

McCormick

McPherson

et al. 2001

IEEE Comput. Grap. Appl.

View full text Add to dashboard Cite

V isualization is an integral part of scientific computation and simulation. Stateof-the-art simulations of physical systems can generate terabytes to petabytes of time-varying data where a single time step can contain more than a gigabyte of data per variable. As memory sizes continue to increase, the size of data sets will likely increase at a comparably high rate. The key to understanding this data is visualizing the global and local relationships of data elements. Direct volume rendering is an excellent method for examining these properties. It lets each data element contribute to the final image and allows querying of the spatial relationship of data elements and their quantitative relationships. Hardware-accelerated volume rendering lets users achieve interactive display rates for reasonably sized data sets. The size of interactive data sets is a function of the hardware's available texture memory and fill rate. Current high-end hardware implementations place an upper bound on data-set sizes at approximately 256 Mbytes. In this article, we present a scalable, pipelined approach for rendering data sets too large for a single graphics card. To do so, we take advantage of multiple hardware rendering units and parallel software compositing. (See the "Previous Work" sidebar on p. 54 for other approaches.) The goals of TRex, our system for interactive volume rendering of large data sets, are to provide near-interactive display rates for time-varying, terabyte-sized uniformly sampled data sets and provide a low-latency platform for volume visualization in immersive environments. We consider 5 frames per second (fps) to be near-interactive rates for normal viewing environments and immersive environments to have a lower bound frame rate of 10 fps. Although this is significantly below most virtual environment update rates, we've found that the user can successfully investigate extremely large data sets at this rate. Using TRex for virtual reality environments requires low latency-around 50 ms per frame or 100 ms per view update or stereo pair. To achieve lower latency renderings, we either render smaller portions of the volume on more graphics pipes or subsample the volume to render fewer samples per frame by each graphics pipe. Unstructured data sets must be resampled to appropriately leverage the 3D texture volume rendering method. Preprocessing Our implementation requires an offline preprocessing step in which the data is quantized from its native

show abstract

Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance

Slaughter

et al. 2020

View full text Add to dashboard Cite

Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU

Göddeke

Wobker

Strzodka

et al. 2009

IJCSE

View full text Add to dashboard Cite

Feast is a hardware-oriented MPI based Finite Element solver toolkit. With the extension FeastGPU the authors have previously demonstrated that significant speed-ups in the solution of the scalar Poisson problem can be achieved by the addition of GPUs as scientific co-processors to a commodity based cluster. In this paper we put the more general claim to the test: Applications based on Feast, that ran only on CPUs so far, can be successfully accelerated on a co-processor enhanced cluster without any code modifications. The chosen solid mechanics code has higher accuracy requirements and a more diverse CPU/co-processor interaction than the Poisson example, and is thus better suited to assess the practicability of our acceleration approach. We present accuracy experiments, a scalability test and acceleration results for different elastic objects under load. In particular, we demonstrate in detail that the single precision execution of the co-processor does not affect the final accuracy. We establish how the local acceleration gains of factors 5.5 to 9.0 translate into 1.6-to 2.6-fold total speed-up. Subsequent analysis reveals which measures will increase these factors further.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.