Proceedings of the International Conference on Supercomputing 2017
DOI: 10.1145/3079079.3079105
|View full text |Cite
|
Sign up to set email alerts
|

Fast segmented sort on GPUs

Abstract: Segmented sort, as a generalization of classical sort, orders a batch of independent segments in a whole array. Along with the wider adoption of manycore processors for HPC and big data applications, segmented sort plays an increasingly important role than sort. In this paper, we present an adaptive segmented sort mechanism on GPUs. Our mechanisms include two core techniques: (1) a differentiated method for dierent segment lengths to eliminate the irregularity caused by various workloads and thread divergence;… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
21
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
3
2

Relationship

3
7

Authors

Journals

citations
Cited by 52 publications
(21 citation statements)
references
References 45 publications
(53 reference statements)
0
21
0
Order By: Relevance
“…Other research looked at specific cases of scan, in [58] the authors look at performing scan on tuples while minimizing global reads and facilitating latency hiding. Recently there has been some work in applying scan and reduction to optimize database queries [37,47,81].…”
Section: Related Workmentioning
confidence: 99%
“…Other research looked at specific cases of scan, in [58] the authors look at performing scan on tuples while minimizing global reads and facilitating latency hiding. Recently there has been some work in applying scan and reduction to optimize database queries [37,47,81].…”
Section: Related Workmentioning
confidence: 99%
“…The binary finite field multiplication algorithm was implemented by Eli Ben-Sasson et al yielded up to 138× speedup than the popular Number Theory Library [5]. Hou et al [14] implemented a register-based sort method shows great improvements over scratchpad memory methods on NVIDIA K80-Kepler and TitanX-Pascal GPUs. A 1-D stencil method is introduced as an example to illustrate how register cache and shuffle instruction works [5].…”
Section: Related Workmentioning
confidence: 99%
“…Compared to stochastic gradient descent (SGD) [8,9], the ALS algorithm is not only inherently parallel, but can incorporate implicit ratings [1]. Nevertheless, the ALS algorithm involves parallel sparse matrix manipulation [10] which is challenging to achieve high performance due to imbalanced workload [11,12,13], random memory access [14,15], unpredictable amount of computations [16] and task dependency [17,18,19]. This particularly holds when parallelizing and optimizing ALS on modern multi-cores and many-cores [20].…”
Section: Introductionmentioning
confidence: 99%