Evaluation and tuning of the Level 3 CUBLAS for graphics processors

Barrachina, Sergio; Dolz, Manuel F.; Igual, Francisco D.; Mayo, Rafael; Quintana‐Ortí, Enrique S.

doi:10.1109/ipdps.2008.4536485

Cited by 69 publications

(48 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The CUBLAS library is distributed with CUDA, and it may not be the fastest implementation at a given time, but it gives an optimized performance [3].…”

Section: Algorithm 1 Calculate a Euclidean Distance Matrix With Matrimentioning

confidence: 99%

Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Wittek

Daranyi

2013

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Scientific computations have been using GPU-enabled computers successfully, often relying on distributed nodes to overcome the limitations of device memory. Only a handful of text mining applications benefit from such infrastructure. Since the initial steps of text mining are typically data-intensive, and the ease of deployment of algorithms is an important factor in developing advanced applications, we introduce a flexible, distributed, MapReducebased text mining workflow that performs I/O-bound operations on CPUs with industry-standard tools and then runs compute-bound operations on GPUs which are optimized to ensure coalesced memory access and effective use of shared memory. We have performed extensive tests of our algorithms on a cluster of eight nodes with two NVidia Tesla M2050 attached to each, and we achieve considerable speedups for random projection and self-organizing maps.

show abstract

“…The CUBLAS library is distributed with CUDA, and it may not be the fastest implementation at a given time, but it gives an optimized performance [3].…”

Section: Algorithm 1 Calculate a Euclidean Distance Matrix With Matrimentioning

confidence: 99%

Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Wittek

Daranyi

2013

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…This process is still under investigation. Another method to hide some of the GPU overhead may involve a hybrid technique in which GPU and CPU operations are performed in parallel, such as that described by Barrachina et al [17].…”

Section: E Analysis Of Performance Improvementmentioning

confidence: 99%

Use of CUDA for the Continuous Space Language Model

Thompson

Anderson

2012

2012 IEEE Conference on High Performance Extreme Computing

View full text Add to dashboard Cite

Abstract-The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.

show abstract

“…As we have discussed before, if the matrices are in CPU memory one can use padding, e.g., as in [5]. We have to allocated a bigger dimension of matrix in GPU memory, put zeroes in the extra elements, then transfer the data from CPU to GPU and then call the Kernel.…”

Section: Performancementioning

confidence: 99%

Accelerating GPU Kernels for Dense Linear Algebra

Nath

Tomov

Dongarra

2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting -a set of GPU specific optimization techniquesallows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20× faster than the currently available kernels. We present these kernels and also show their acceleration effect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.

show abstract

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

Abstract: The increase in performance of the last generations of graphics processors (GPUs)

Cited by 69 publications

References 5 publications

Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Use of CUDA for the Continuous Space Language Model

Accelerating GPU Kernels for Dense Linear Algebra

Contact Info

Product

Resources

About