SUMMARYMatrix factorization is one of the leading techniques for many applications such as social network-based recommendation systems. As of today, many parallel stochastic gradient descent (SGD) methods have been proposed to address the matrix factorization issue on shared-memory (multi-core) systems and distributed systems. However, these methods cannot be improved significantly on graphics processing unit (GPU) because the serious over-writing problem and thread divergence may occur. The fundamental reason for such undesired results is that GPU is a parallel single instruction multiple data device, which only can greatly improve the applications with fine-grained parallelism. In this paper, we propose an efficient GPU algorithm, named GPUSGD, to solve the matrix factorization problem based on SGD method. The major advantage of the proposed GPUSGD is that such method not only can handle the over-writing problem but also can avoid the performance loss caused by the thread divergence. The experimental results show that GPUSGD performs much better in accelerating the matrix factorization compared with the existing state-of-the-art parallel methods. To the best of our knowledge, this is the first work that develops a parallel SGD method to improve the matrix factorization on GPU.
Summary
In this paper, we study the issue of improving the performance of Markov chain Monte Carlo method to solve local PageRank problem under General Purpose Graphics Processing Unit environment. As a large number of dangling vertices cause large storage space of dangling vertices and thus slow down the Markov chain procession, we propose a reordering strategy to compress the storage space and reduce the computational complexity of Markov chain procession. In our performance study, by parallelizing and optimizing the proposed algorithm based on GPU, the reordering strategy can be up 12× faster compared with basic method, where the graphs have high‐proportion dangling vertices. According to our investigation on this issue, the variance of random walks determines the number of random walks in the computation; we thus introduce low‐discrepancy sequences to enhance the performance. Moreover, the low‐discrepancy sequences are organized to load in the on‐chip shared memory to accelerate fetching with a wise warp scheduling for bank conflict schema. A series of experiments have been conducted to evaluate the optimization efficiency. Compared with fetching data from off‐chip global memory, the shared‐memory‐based strategy can have over 10× speedup ratio performance. The experiments indicate that the size of shared memory has a significant impact on the parallelism of the proposed method as well.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.