Efficient and high‐quality sparse graph coloring on GPUs

Chen, Xuhao; Li, Pingfan; Fang, Jianbin; Tang, Tao; Wang, Zhiying; Yang, Chi

doi:10.1002/cpe.4064

Cited by 14 publications

(9 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Graph analytics have been widely applied in many applications [39,40]. In this paper, we have presented HPGraph, a GPU graph analytics framework which maps a vertex programming to an optimized matrix backend.…”

Section: Discussionmentioning

confidence: 99%

HPGraph: High-Performance Graph Analytics with Productivity on the GPU

Yang

Liu

et al. 2018

Scientific Programming

View full text Add to dashboard Cite

The growing use of graph in many fields has sparked a broad interest in developing high-level graph analytics programs. Existing GPU implementations have limited performance with compromising on productivity. HPGraph, our high-performance bulk-synchronous graph analytics framework based on the GPU, provides an abstraction focused on mapping vertex programs to generalized sparse matrix operations on GPU as the backend. HPGraph strikes a balance between performance and productivity by coupling high-performance GPU computing primitives and optimization strategies with a high-level programming model for users to implement various graph algorithms with relatively little effort. We evaluate the performance of HPGraph for four graph primitives (BFS, SSSP, PageRank, and TC). Our experiments show that HPGraph matches or even exceeds the performance of high-performance GPU graph libraries such as MapGraph, nvGraph, and Gunrock. HPGraph also runs significantly faster than advanced CPU graph libraries.

show abstract

Section: Discussionmentioning

confidence: 99%

HPGraph: High-Performance Graph Analytics with Productivity on the GPU

Yang

Liu

et al. 2018

Scientific Programming

View full text Add to dashboard Cite

show abstract

“…The experimental results show that Feluca achieves up to 8.39×, 14.70×, 7.55×, and 9.70× speed up over kokkos [20], Gunrock [36], SIRG [44] and ChenGC [42], [43], respectively. Table 4 shows that Feluca outperforms all other competitors in terms of run-time with all ten datasets.…”

Section: Comparison Against the State-of-the-art Techniquesmentioning

confidence: 99%

“…We compared Feluca with some state-of-the-art methods in this area, such as kokkos [20], Gunrock [36], GraphBLAST [41], ChenGC [42], [43], SIRG [44], cuSPARSE [40] and JPL [40]. In this experiment, Feluca switches the execution stage by setting α to 10%.…”

Section: Comparison Against the State-of-the-art Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

Feluca: A Two-Stage Graph Coloring Algorithm With Color-Centric Paradigm on GPU

Zheng

Shi

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Compared to stochastic gradient descent (SGD) [8,9], the ALS algorithm is not only inherently parallel, but can incorporate implicit ratings [1]. Nevertheless, the ALS algorithm involves parallel sparse matrix manipulation [10] which is challenging to achieve high performance due to imbalanced workload [11,12,13], random memory access [14,15], unpredictable amount of computations [16] and task dependency [17,18,19]. This particularly holds when parallelizing and optimizing ALS on modern multi-cores and many-cores [20].…”

Section: Introductionmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8×-15.7× faster on an Intel 16-core CPU, 23.9×-87.9× faster on an NVIDIA K20C GPU and 34.6×-97.1× faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

show abstract

Efficient and high‐quality sparse graph coloring on GPUs

Cited by 14 publications

References 61 publications

HPGraph: High-Performance Graph Analytics with Productivity on the GPU

HPGraph: High-Performance Graph Analytics with Productivity on the GPU

Feluca: A Two-Stage Graph Coloring Algorithm With Color-Centric Paradigm on GPU

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Contact Info

Product

Resources

About