Benchmarking of communication techniques for GPUs

Bernaschi, Massimo; Bisson, Mauro; Rossetti, Davide

doi:10.1016/j.jpdc.2012.09.006

Cited by 16 publications

(19 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…On the same GPU, when the counter pool contains the same number of counters, VATE 's PT is significantly less than VDRE 's (VATE 's PT accounts for only 25% to 0.25% of VDRE 's PT). On GTX650-1GB, when the number of counters is 2 28 , the PT of VDRE is as high as 1296 milliseconds, and the sum of three running times is 1447 milliseconds. For the algorithm under the sliding time window, the total running time in each time slice should not exceed the length of a time slice.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

VATE: A trade-off between memory and preserving time for high accurate cardinality estimation under sliding time window

Ding

et al. 2019

Computer Communications

View full text Add to dashboard Cite

Section: Methodsmentioning

confidence: 99%

“…A GPU chip contains hundreds to thousands of processing units, far more than that in the CPU. For tasks without data access conflicts and using the same instructions to process different data (single instruction multiple data streams, SIMD), GPU can achieve high speedup [28] [29].…”

Section: Deploy Vate On Gpumentioning

confidence: 99%

VATE: A trade-off between memory and preserving time for high accurate cardinality estimation under sliding time window

Ding

et al. 2019

Computer Communications

View full text Add to dashboard Cite

“…Graphic processing unit (GPU) is one of the most popular parallel computing platform in recent years. For these tasks that have no data accessing conflict and processing different data with the same instructions (SIMD), GPU can acquire a high speed up [2] [19]. Every packet will update SEAV and LDCA.…”

Section: Distributed Super Points Detection On Gpumentioning

confidence: 99%

Most Memory Efficient Distributed Super Points Detection on Core Networks

Ding

2018

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

The super point, a host which communicates with lots of others, is a kind of special hosts gotten great focus. Mining super point at the edge of a network is the foundation of many network research fields. In this paper, we proposed the most memory efficient super points detection scheme. This scheme contains a super points reconstruction algorithm called short estimator and a super points filter algorithm called long estimator. Short estimator gives a super points candidate list using thousands of bytes memory and long estimator improves the accuracy of detection result using millions of bytes memory. Combining short estimator and long estimator, our scheme acquires the highest accuracy using the smallest memory than other algorithms. There is no data conflict and floating operation in our scheme. This ensures that our scheme is suitable for parallel running and we deploy our scheme on a common GPU to accelerate processing speed. We also describe how to extend our algorithm to sliding time. Experiments on several real-world core network traffics show that our algorithm acquires the highest accuracy with only consuming littler than one-fifth memory of other algorithms.

show abstract

“…As far as we know, there are just a few works showing strong scaling results for spin systems [13,20,21]. We chose to adopt the same technique proposed in [20,21] where the partitioning is performed along the z -axis of the system. All communications among nodes are handled by MPI and the overlap between calculations and communications is achieved by using CUDA streams.…”

Section: Multi-gpu Implementationmentioning

confidence: 99%

Highly optimized simulations on single- and multi-GPU systems of the 3D Ising spin glass model

Lulli

Bernaschi

Parisi

2015

Computer Physics Communications

Self Cite

View full text Add to dashboard Cite

a b s t r a c tWe present a highly optimized implementation of a Monte Carlo (MC) simulator for the threedimensional Ising spin-glass model with bimodal disorder, i.e., the 3D Edwards-Anderson model running on CUDA enabled GPUs. Multi-GPU systems exchange data by means of the Message Passing Interface (MPI). The chosen MC dynamics is the classic Metropolis one, which is purely dissipative, since the aim was the study of the critical off-equilibrium relaxation of the system. We focused on the following issues: (i) the implementation of efficient memory access patterns for nearest neighbours in a cubic stencil and for lagged-Fibonacci-like pseudo-Random Numbers Generators (PRNGs); (ii) a novel implementation of the asynchronous multispin-coding Metropolis MC step allowing to store one spin per bit and (iii) a multi-GPU version based on a combination of MPI and CUDA streams. Cubic stencils and PRNGs are two subjects of very general interest because of their widespread use in many simulation codes.

show abstract

Benchmarking of communication techniques for GPUs

Cited by 16 publications

References 8 publications

VATE: A trade-off between memory and preserving time for high accurate cardinality estimation under sliding time window

VATE: A trade-off between memory and preserving time for high accurate cardinality estimation under sliding time window

Most Memory Efficient Distributed Super Points Detection on Core Networks

Highly optimized simulations on single- and multi-GPU systems of the 3D Ising spin glass model

Contact Info

Product

Resources

About