SIMD- and cache-friendly algorithm for sorting an array of structures

Inoue, Hiroshi; Taura, Kenjiro

doi:10.14778/2809974.2809988

Cited by 43 publications

(27 citation statements)

References 20 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Merging two sorted arrays using traditional comparison instructions is sub-optimal: The aggressive out-of-order cores are not able to predict the direction of the merge branch (i.e., which of the two arrays will give the next element). Recent projects [26,43] show how to use SIMD instructions for efficient merging. Using 128-bit instructions, we can create a bitonic merge network that merges 8 elements at a time.…”

Section: Using Libmctop In Parallel Mergesortmentioning

confidence: 99%

Abstracting Multi-Core Topologies with MCTOP

Chatzopoulos

Guerraoui

Harris

et al. 2017

Proceedings of the Twelfth European Conference on Computer Systems

View full text Add to dashboard Cite

Portability and efficiency are usually antagonists in multicore computing. In order to develop efficient code, one needs to take into account the topology of the target multi-cores (e.g., for locality). This clearly hampers code portability. In this paper, we show that you can have the cake and eat it too.We introduce MCTOP, an abstraction of multi-core topologies augmented with important low-level hardware information, such as memory bandwidths and communication latencies. We show how to automatically generate MCTOP using libmctop, our library that leverages the determinism of cache-coherence protocols to infer the topology of multi-cores using only latency measurements.MCTOP enables developers to accurately and portably define high-level performance optimization policies. We illustrate several such policies through four examples: (i-ii) thread placement in OpenMP and in a MapReduce library, (iii) a topology-aware mergesort algorithm, as well as (iv) automatic backoff schemes for locks. We illustrate the portability of these optimizations on five processors from Intel, AMD, and Oracle, with low effort.

show abstract

Section: Using Libmctop In Parallel Mergesortmentioning

confidence: 99%

Abstracting Multi-Core Topologies with MCTOP

Chatzopoulos

Guerraoui

Harris

et al. 2017

Proceedings of the Twelfth European Conference on Computer Systems

View full text Add to dashboard Cite

show abstract

“…SIGMOD'17, May [14][15][16][17][18][19]2017 trends [21,6,23,1,3,33,5,22,30,36]. The availability of low-cost memory, for instance, has given rise to the wide adoption of in-memory databases [35,26,24,8].…”

Section: Introductionmentioning

confidence: 99%

“…Moreover, sorting can speed up duplicate removal, ranking, and grouping operations [13]. Therefore, a lot of research has been devoted to identifying efficient sorting algorithms that utilise modern hardware features and scale well across multiple cores, processors, and even nodes [21,6,35,40,24,33,22,8]. After having recently achieved sorting rates of over one billion keys per second [28], Graphics Processing Units (GPUs), featuring thousands of cores and a memory bandwidth of several hundred gigabytes per second, emerged as a promising platform to accelerate sorting.…”

Section: Introductionmentioning

confidence: 99%

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Stehle

Jacobsen

2017

Proceedings of the 2017 ACM International Conference on Management of Data

View full text Add to dashboard Cite

Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidthbound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eightbyte records in as little as 50 milliseconds, our approach achieves a 2.32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-toend sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.

show abstract

“…Sorting is one of the most fundamental computation kernels in data management, and lots of approaches to accelerate the kernel have been proposed [1]- [8]. These approaches offer significant results, but mostly these studies utilize SIMD instructions of Intel processors [1], [7], [8] to exploit datalevel parallelism or experiment on rich hardware environments such as supercomputers [5] or clusters [7].…”

Section: Introductionmentioning

confidence: 99%

“…These approaches offer significant results, but mostly these studies utilize SIMD instructions of Intel processors [1], [7], [8] to exploit datalevel parallelism or experiment on rich hardware environments such as supercomputers [5] or clusters [7]. It is unclear that these approaches are available on low computational performance machines like embedded systems.…”

Section: Introductionmentioning

confidence: 99%

A High Performance FPGA-Based Sorting Accelerator with a Data Compression Mechanism

Kobayashi

Kise

2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYSorting is an extremely important computation kernel that has been accelerated in a lot of fields such as databases, image processing, and genome analysis. Given that advent of Internet of Things (IoT) era due to mobile technology progressions, the future needs a sorting method that is available on any environment, such as not only high performance systems like servers but also low computational performance machines like embedded systems. In this paper, we present an FPGA-based sorting accelerator combining Sorting Network and Merge Sorter Tree, which is customizable by means of tuning design parameters. The proposed FPGA accelerator sorts data sent from a host PC via the PCIe bus, and sends back the fully sorted data sequence to it. We also present a detailed analytical model that accurately estimates the sorting performance. Due to these characteristics, designers can know how fast a developed sorting hardware is in advance and can implement the best one to fulfill the cost and performance constraints. Our experiments show that the proposed hardware achieves up to 19.5x sorting performance, compared with Intel Core i7-3770K operating at 3.50GHz, when sorting 256M 32-bits integer elements. However, this result is limited because of insufficient memory bandwidth. To overcome this problem, we propose a data compression mechanism and the experimental result shows that the sorting hardware with it achieves almost 90% of the estimated performance, while the hardware without it does about 60%. In order to allow every designer to easily and freely use this accelerator, the RTL source code is released as open-source hardware.

show abstract

SIMD- and cache-friendly algorithm for sorting an array of structures

Cited by 43 publications

References 20 publications

Abstracting Multi-Core Topologies with MCTOP

Abstracting Multi-Core Topologies with MCTOP

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

A High Performance FPGA-Based Sorting Accelerator with a Data Compression Mechanism

Contact Info

Product

Resources

About