2011
DOI: 10.1007/978-3-642-23397-5_16
|View full text |Cite
|
Sign up to set email alerts
|

Engineering a Multi-core Radix Sort

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
37
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 37 publications
(39 citation statements)
references
References 8 publications
0
37
0
Order By: Relevance
“…Researchers further optimized this buffering scheme to take advantage of writecombining and nontemporal stores [Wassenberg and Sanders 2011]. The idea is that each buffer should be at a cache-line granularity to maximize the partition fanout and that wide (can use SIMD registers) nontemporal writes should be used to store the result to the output to avoid polluting the cache with output data that are not going to be needed again any time soon.…”
Section: Data Shuffling Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Researchers further optimized this buffering scheme to take advantage of writecombining and nontemporal stores [Wassenberg and Sanders 2011]. The idea is that each buffer should be at a cache-line granularity to maximize the partition fanout and that wide (can use SIMD registers) nontemporal writes should be used to store the result to the output to avoid polluting the cache with output data that are not going to be needed again any time soon.…”
Section: Data Shuffling Discussionmentioning
confidence: 99%
“…This algorithm works very well in-cache but suffers from the same problems as the non-in-place naive approach when the working set footprint exceeds the cache size. The solution proposed by Polychroniou and Ross [2014] adapts the buffering and write-combining techniques of Satish et al [2010] and Wassenberg and Sanders [2011] to accelerate efficient in-place partitioning.…”
Section: Data Shuffling Discussionmentioning
confidence: 99%
“…We can even tune the compression rate, by employing more partition passes to create wider prefixes. Each pass has been shown to be very efficient on memory-resident data, close to the RAM copy bandwidth [29,34]. If the inputs retain dictionary encoding through the join, the number of distinct values using the same prefix is maximized.…”
Section: Traffic Compressionmentioning
confidence: 99%
“…A 40 Gbps InfiniBand measured less than 3 GB/s real data rate per node during hash partitioning. If done in RAM, partitioning to a few thousand outputs runs close to the memory copy bandwidth [29,34]. For instance, a server using 4X 8-core CPUs and 1333 MHz quad-channel DDR3 DRAM achieves a partition rate of 30-35 GB/s, more than an order of magnitude higher than the InfiniBand network.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation