Engineering a Multi-core Radix Sort

Wassenberg, Jan; Sanders, Peter

doi:10.1007/978-3-642-23397-5_16

Cited by 37 publications

(39 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Researchers further optimized this buffering scheme to take advantage of writecombining and nontemporal stores [Wassenberg and Sanders 2011]. The idea is that each buffer should be at a cache-line granularity to maximize the partition fanout and that wide (can use SIMD registers) nontemporal writes should be used to store the result to the output to avoid polluting the cache with output data that are not going to be needed again any time soon.…”

Section: Data Shuffling Discussionmentioning

confidence: 99%

“…This algorithm works very well in-cache but suffers from the same problems as the non-in-place naive approach when the working set footprint exceeds the cache size. The solution proposed by Polychroniou and Ross [2014] adapts the buffering and write-combining techniques of Satish et al [2010] and Wassenberg and Sanders [2011] to accelerate efficient in-place partitioning.…”

Section: Data Shuffling Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Energy Analysis of Hardware and Software Range Partitioning

Polychroniou

Barker

et al. 2014

ACM Trans. Comput. Syst.

View full text Add to dashboard Cite

Data partitioning is a critical operation for manipulating large datasets because it subdivides tasks into pieces that are more amenable to efficient processing. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries. This article measures the performance and energy of state-of-the-art software partitioners, and describes and evaluates a hardware range partitioner that further improves efficiency.The software implementation is broken into two phases, allowing separate analysis of the partition function computation and data shuffling costs. Although range partitioning is commonly thought to be more expensive than simpler strategies such as hash partitioning, our measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioning.For further acceleration, we describe a hardware range partitioner, or HARP, a streaming framework that offers a seamless execution environment for this and other streaming accelerators, and a detailed analysis of a 32nm physical design that matches the throughput of four to eight software threads while consuming just 6.9% of the area and 4.3% of the power of a Xeon core in the same technology generation.

show abstract

Section: Data Shuffling Discussionmentioning

confidence: 99%

Section: Data Shuffling Discussionmentioning

confidence: 99%

Energy Analysis of Hardware and Software Range Partitioning

Polychroniou

Barker

et al. 2014

ACM Trans. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…We can even tune the compression rate, by employing more partition passes to create wider prefixes. Each pass has been shown to be very efficient on memory-resident data, close to the RAM copy bandwidth [29,34]. If the inputs retain dictionary encoding through the join, the number of distinct values using the same prefix is maximized.…”

Section: Traffic Compressionmentioning

confidence: 99%

“…A 40 Gbps InfiniBand measured less than 3 GB/s real data rate per node during hash partitioning. If done in RAM, partitioning to a few thousand outputs runs close to the memory copy bandwidth [29,34]. For instance, a server using 4X 8-core CPUs and 1333 MHz quad-channel DDR3 DRAM achieves a partition rate of 30-35 GB/s, more than an order of magnitude higher than the InfiniBand network.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Track join

Polychroniou

Sen

Ross

2014

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

Network communication is the slowest component of many operators in distributed parallel databases deployed for largescale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs, but could avoid redundant transfers of tuples across the network. We introduce track join, a novel distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Our evaluation based on real and synthetic data shows that track join adapts to diverse cases and degrees of locality. Considering both network traffic and execution time, even with no locality, track join outperforms hash join on the most expensive queries of real workloads.

show abstract

One machine, one minute, three billion tetrahedra

Marot

Pellerin

Remacle

2018

Numerical Meth Engineering

View full text Add to dashboard Cite

This paper presents a new scalable parallelization scheme to generate the 3D Delaunay triangulation of a given set of points. Our first contribution is an efficient serial implementation of the incremental Delaunay insertion algorithm. KEYWORDS3D Delaunay triangulation, parallel delaunay, radix sort, SFC partitioning, tetrahedral mesh generation Int J Numer Methods Eng. 2019;117:967-990. wileyonlinelibrary.com/journal/nme

show abstract

Engineering a Multi-core Radix Sort

Cited by 37 publications

References 8 publications

Energy Analysis of Hardware and Software Range Partitioning

Energy Analysis of Hardware and Software Range Partitioning

Track join

One machine, one minute, three billion tetrahedra

Contact Info

Product

Resources

About