Parallel Transposition of Sparse Data Structures

Wang, Hao; Liu, Weifeng; Hou, Kaixi; Feng, Wu-chun

doi:10.1145/2925426.2926291

Cited by 37 publications

(29 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To benchmark atomic operations, we use two kernels that involves atomic operations: an atomic-based SpTRANS method described by Wang et al (2016) and a synchronization-free SpTRSV algorithm proposed by Liu et al (2017). The SpTRANS method first uses atomic-add operations to sum the number of nonzeros in each column (assuming both the input and output matrices are in row-major) and then scatters nonzeros in rows into columns through an atomicbased counter.…”

Section: Sparse Kernelsmentioning

confidence: 99%

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

et al. 2019

Self Cite

View full text Add to dashboard Cite

Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip, thus can exploit the advantages and avoid disadvantages of those compute units. We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU-GPU heterogeneous processor by using 956 sparse matrices. Five characteristics, i.e., load balancing, indirect addressing, memory reallocation, atomic operations, and dynamic characteristics are our major considerations. The experimental results show that although the CPU and GPU parts access the same DRAM, very different performance behaviors are observed. For example, though the GPU part in general outperforms the CPU part, it cannot achieve the best performance in all cases given by the CPU part. Moreover, the bandwidth utilization of atomic operations on heterogeneous processors can be much higher than a high-end discrete GPU.

show abstract

Section: Sparse Kernelsmentioning

confidence: 99%

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many studies investigate the data-level parallelism on x86-based systems [21,23,36,42]. Correspondingly, several studies have illustrated the benets of using registers to improve performance on GPUs.…”

Section: Sux Array Constructionmentioning

confidence: 99%

“…both for traditional HPC applications and for big data processing. In these cases, a large amount of independent arrays oen need to be sorted as a whole, either because of algorithm characteristics (e.g., sux array construction in prex doubling algorithms from bioinformatics [15,44]), or dataset properties (e.g., sparse matrices in linear algebra [4,[28][29][30][31]42]), or real-time requests from web users (e.g., queries in data warehouse [45,49,51]). e second trend is that with the rapidly increased computational power of new processors, sorting a single array at a time usually cannot fully utilize the devices, thus grouping multiple independent arrays and sorting them simultaneously are crucial for high utilization.…”

mentioning

confidence: 99%

Fast segmented sort on GPUs

Hou

Liu

Wang

et al. 2017

Proceedings of the International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

Segmented sort, as a generalization of classical sort, orders a batch of independent segments in a whole array. Along with the wider adoption of manycore processors for HPC and big data applications, segmented sort plays an increasingly important role than sort. In this paper, we present an adaptive segmented sort mechanism on GPUs. Our mechanisms include two core techniques: (1) a differentiated method for dierent segment lengths to eliminate the irregularity caused by various workloads and thread divergence; and (2) a register-based sort method to support N-to-M data-thread binding and in-register data communication. We also implement a shared memory-based merge method to support non-uniform length chunk merge via multiple warps. Our segmented sort mechanism shows great improvements over the methods from CUB, CUSP and ModernGPU on NVIDIA K80-Kepler and TitanX-Pascal GPUs. Furthermore, we apply our mechanism on two applications, i.e., sux array construction and sparse matrix-matrix multiplication, and obtain obvious gains over state-of-the-art implementations.

show abstract

“…Compared to stochastic gradient descent (SGD) [8,9], the ALS algorithm is not only inherently parallel, but can incorporate implicit ratings [1]. Nevertheless, the ALS algorithm involves parallel sparse matrix manipulation [10] which is challenging to achieve high performance due to imbalanced workload [11,12,13], random memory access [14,15], unpredictable amount of computations [16] and task dependency [17,18,19]. This particularly holds when parallelizing and optimizing ALS on modern multi-cores and many-cores [20].…”

Section: Introductionmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8×-15.7× faster on an Intel 16-core CPU, 23.9×-87.9× faster on an NVIDIA K20C GPU and 34.6×-97.1× faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

show abstract

Parallel Transposition of Sparse Data Structures

Cited by 37 publications

References 42 publications

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Fast segmented sort on GPUs

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Contact Info

Product

Resources

About