Massively Parallel NUMA-Aware Hash Joins

Lang, Harald; Leis, Viktor; Albutiu, Martina-Cezara; Neumann, Thomas; Kemper, Alfons

doi:10.1007/978-3-319-13960-9_1

Cited by 28 publications

(33 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, Kim et al [1] look at the effects of caches and TLBs (translation lookaside buffers) on main-memory parallel hash joins and show how careful partitioning according to the cache and TLB sizes leads to improved performance. Along the same lines, Lang et al [2] have shown how tuning to the non-uniform memory access (NUMA) characteristics also leads to improved performance of parallel hash joins. We will refer to the algorithms that take hardware characteristics into consideration as hardware-conscious.…”

Section: Introductionmentioning

confidence: 92%

Main-Memory Hash Joins on Modern Processor Architectures

Balkesen

Teubner

Alonso

et al. 2015

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Abstract-Existing main-memory hash join algorithms for multi-core can be classified into two camps. Hardware-oblivious hash join variants do not depend on hardware-specific parameters. Rather, they consider qualitative characteristics of modern hardware and are expected to achieve good performance on any technologically similar platform. The assumption behind these algorithms is that hardware is now good enough at hiding its own limitations-through automatic hardware prefetching, out-of-order execution, or simultaneous multi-threading (SMT)-to make hardware-oblivious algorithms competitive without the overhead of carefully tuning to the underlying hardware. Hardware-conscious implementations, such as (parallel) radix join, aim to maximally exploit a given architecture by tuning the algorithm parameters (e.g., hash table sizes) to the particular features of the architecture. The assumption here is that explicit parameter tuning yields enough performance advantages to warrant the effort required. This paper compares the two approaches under a wide range of workloads (relative table sizes, tuple sizes, effects of sorted data, etc.) and configuration parameters (VM page sizes, number of threads, number of cores, SMT, SIMD, prefetching, etc.). The results show that hardware-conscious algorithms generally outperform hardware-oblivious ones. However, on specific workloads and special architectures with aggressive simultaneous multi-threading, hardware-oblivious algorithms are competitive. The main conclusion of the paper is that, in existing multi-core architectures, it is still important to carefully tailor algorithms to the underlying hardware to get the necessary performance. But processor developments may require to revisit this conclusion in the future.

show abstract

Section: Introductionmentioning

confidence: 92%

Main-Memory Hash Joins on Modern Processor Architectures

Balkesen

Teubner

Alonso

et al. 2015

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

“…E.g., Albutiu et al [6] show that prefetching can hide the latency of remote accesses, constructing a competitive sort-merge join. Hash-joins, however, are shown to be superior [8,20]. Yinan et al [25] optimize data shuffling on a fully-interconnected NUMA topology.…”

Section: Related Workmentioning

confidence: 99%

Scaling up concurrent main-memory column-store scans

et al. 2015

View full text Add to dashboard Cite

Main-memory column-stores are called to efficiently use modern non-uniform memory access (NUMA) architectures to service concurrent clients on big data. The efficient usage of NUMA architectures depends on the data placement and scheduling strategy of the column-store. Most column-stores choose a static strategy that involves partitioning all data across the NUMA architecture, and employing a stealingbased task scheduler. In this paper, we implement different strategies for data placement and task scheduling for the case of concurrent scans. We compare these strategies with an extensive sensitivity analysis. Our most significant findings include that unnecessary partitioning can hurt throughput by up to 70%, and that stealing memory-intensive tasks can hurt throughput by up to 58%. Based on our analysis, we envision a design that adapts the data placement and task scheduling strategy to the workload.

show abstract

“…[4] proposes a partitioned join that minimizes random inter-socket reads, and [15] improves upon that with a NUMA-aware data shuffling stage. [13] presents a latch-free hash table design for scalable NUMAaware build phase. We think that simplifying hash joins to a series of DIRA lookups will make hardware acceleration easier, because we can repeatedly use a gather primitive.…”

Section: Related Workmentioning

confidence: 99%

Memory-efficient hash joins

et al. 2014

View full text Add to dashboard Cite

We present new hash tables for joins, and a hash join based on them, that consumes far less memory and is usually faster than recently published in-memory joins. Our hash join is not restricted to outer tables that fit wholly in memory. Key to this hash join is a new concise hash table (CHT), a linear probing hash table that has 100% fill factor, and uses a sparse bitmap with embedded population counts to almost entirely avoid collisions. This bitmap also serves as a Bloom filter for use in multi-table joins.We study the random access characteristics of hash joins, and renew the case for non-partitioned hash joins. We introduce a variant of partitioned joins in which only the build is partitioned, but the probe is not, as this is more efficient for large outer tables than traditional partitioned joins. This also avoids partitioning costs during the probe, while at the same time allowing parallel build without latching overheads. Additionally, we present a variant of CHT, called a concise array table (CAT), that can be used when the key domain is moderately dense. CAT is collision-free and avoids storing join keys in the hash table.We perform a detailed comparison of CHT and CAT against leading in-memory hash joins. Our experiments show that we can reduce the memory usage by one to three orders of magnitude, while also being competitive in performance.

show abstract

Massively Parallel NUMA-Aware Hash Joins

Cited by 28 publications

References 8 publications

Main-Memory Hash Joins on Modern Processor Architectures

Main-Memory Hash Joins on Modern Processor Architectures

Scaling up concurrent main-memory column-store scans

Memory-efficient hash joins

Contact Info

Product

Resources

About