Barra: A Parallel Functional Simulator for GPGPU

Collange, Sylvain; Daumas, Marc; Defour, David; Parello, David

doi:10.1109/mascots.2010.43

Cited by 84 publications

(46 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We develop a memory transaction simulator to compute the number of transactions at the hardware level. We use the functional simulator Barra [6] to generate the dynamic program execution information on how many times each instruction is executed. Then we use this information to generate the number of dynamic instructions of each type, the number of shared memory transactions, the number of global memory transactions, and the number of stages divided by synchronization barriers.…”

Section: Performance Modeling and Analysis Methodologymentioning

confidence: 99%

“…Since the instruction set of native machine code is not publicly documented, we use the disassembler Decuda developed by van der Laan [16], on which Barra [6] is based as well. With the assistance of Decuda, we build a tool to modify the original binary instructions, assemble the modified instructions back to the binary code sequence, and finally embed the modified code into the execution file.…”

Section: Performance Modelingmentioning

confidence: 99%

“…For example, if 3 threads read from different locations in the same bank, there would be 3 memory transactions, instead of 1 in the case they read from different banks. Since the functional simulator Barra [6] does not collect bank conflicts information, we wrote an automated program to derive the effective number of shared memory transactions by specifying the degree of bank conflicts of each shared memory access.…”

Section: Shared Memorymentioning

confidence: 99%

“…However, compared to an enormous amount of efforts devoted to application development, little has been done on supporting tools for performance profiling and analysis. Commercial program profiling tools such as ATI Stream Profiler [4] and NVIDIA Parallel Nsight [5], along with academic GPU functional simulators [6,7], are limited to providing program statistics only, but do not relate these statistics to program performance. Therefore, the hard work of identifying program bottlenecks and estimating the benefits of potential optimizations is done by programmers' paper-and-pencil analysis.…”

Section: Introductionmentioning

confidence: 99%

“…Third, our model is based on a native GPU instruction set instead of the intermediate PTX [10] assembly language or a high-level language. Simulating only the PTX instruction set leads to poor accuracy, because PTX code is not run directly on GPU hardware but instead is further compiled to native machine instructions where significant compiler optimizations are applied [6]. Fourth, these two studies are mainly based on static program statistics, while ours is based on dynamic program statistics collected from the Barra simulator, which enables us to handle datadependent applications.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A quantitative performance analysis model for GPU architectures

Zhang

Owens

2011

2011 IEEE 17th International Symposium on High Performance Computer Architecture

201

100

View full text Add to dashboard Cite

We develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU's native instruction set, we can predict performance with a 5-15% error. To demonstrate the usefulness of the model, we analyze three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60% and 18% respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity.

show abstract

Section: Performance Modeling and Analysis Methodologymentioning

confidence: 99%

Section: Performance Modelingmentioning

confidence: 99%

Section: Shared Memorymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A quantitative performance analysis model for GPU architectures

Zhang

Owens

2011

2011 IEEE 17th International Symposium on High Performance Computer Architecture

201

100

View full text Add to dashboard Cite

show abstract

Faithful performance prediction of a dynamic task‐based runtime system for heterogeneous multi‐core architectures

Stanisic

Thibault

Legrand

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYMulti-core architectures comprising several graphics processing units (GPUs) have become mainstream in the field of high-performance computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully off-load computations and manage data movements between the different processing units. The most promising and successful approaches so far build on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, obtaining accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator of distributed systems. This approach allows to obtain performance predictions of classical dense linear algebra kernels accurate within a few percents and in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end graphics processing units or not. Additionally, it allows to conduct robust and extensive scheduling studies in a controlled environment whose characteristics are very close to real platforms while having reproducible behavior.

show abstract