Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects

Lutz, Clemens; Breß, Sebastian; Zeuch, Steffen; Rabl, Tilmann; Markl, Volker

doi:10.1145/3514221.3517911

Cited by 9 publications

(3 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several works have shown the benefits of adapting applications to the underlying platform, e.g., by using SIMD [41,44,45,57]. IBM Power systems are used in previous work, as they integrate well with various accelerators [49][50][51]. Bari et al [12] find that A64FX single-thread performance is low, in line with our findings.…”

Section: Related Worksupporting

confidence: 89%

Analyzing Vectorized Hash Tables across CPU Architectures

Böther,

Benson,

Klimovic

et al. 2023

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

Data processing systems often leverage vector instructions to achieve higher performance. When applying vector instructions, an often overlooked data structure is the hash table, even though it is fundamental in data processing systems for operations such as indexing, aggregating, and joining. In this paper, we characterize and evaluate three fundamental vectorized hashing schemes, vectorized linear probing (VLP), vectorized fingerprinting (VFP), and bucket-based comparison (BBC). We implement these hashing schemes on the x86, ARM, and Power CPU architectures, as modern database systems must provide efficient implementations for multiple platforms due to the continuously increasing hardware heterogeneity. We present various implementation variants and platform-specific optimizations, which we evaluate for integer keys, string keys, large payloads, skewed distributions, and multiple threads. Our extensive evaluation and comparison to three scalar hashing schemes on four servers shows that BBC outperforms scalar linear probing by a factor of more than 2x, while also scaling well to high load factors. We find that vectorized hashing schemes come with caveats that need to be considered, such as the increased engineering overhead, differences between CPUs, and differences between vector ISAs, such as AVX and AVX-512, which impact performance. We conclude with key findings for vectorized hashing scheme implementations.

show abstract

Section: Related Worksupporting

confidence: 89%

Analyzing Vectorized Hash Tables across CPU Architectures

Böther,

Benson,

Klimovic

et al. 2023

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

show abstract

“…This approach facilitated the rapid integration of FPGA kernels into existing software and communication patterns. However, it is important to note that the OpenCL programming model was initially designed to leverage the acceleration characteristics of GPUs, which by essence involve processing large volumes of data [37,70]. In contrast, FPGAs do not necessarily follow the same principle, and can operate efficiently as fine-grained data-flow units, handling smaller data sets at a time.…”

Section: Discussionmentioning

confidence: 99%

Strega : An HTTP Server for FPGAs

Maschi,

Alonso

2024

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present S trega , an open-source 1 light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single S trega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.

show abstract

“…In heterogeneous compute architectures, the overhead of transferring data (e.g., between host and graphics processing unit (GPU) memory) can still have a major impact on the overall performance, even when the latest state-of-the-art interconnection technologies are used such as NVLink-2 on the intra-node level 1,2 and InfiniBand EDR on the inter-node level. 1 For many data-intensive applications, scaling out to multiple nodes is the most feasible strategy to satisfy their resource demands.…”

Section: Introductionmentioning

confidence: 99%

Improved data transfer efficiency for scale‐out heterogeneous workloads using on‐the‐fly I/O link compression

Plauth

Micó

Polze

2020

Concurrency and Computation

View full text Add to dashboard Cite

Graphics processing units (GPUs) are unarguably vital to keep up with the perpetually growing demand for compute capacity of data-intensive applications. However, the overhead of transferring data between host and GPU memory is already a major limiting factor on the single-node level. The situation intensifies in scale-out scenarios, where data movement is becoming even more expensive. By augmenting the CloudCL framework with 842-based compression facilities, this article demonstrates that transparent on-the-fly I/O link compression can yield performance improvements between 1.11× and 2.07× across tested scale-out GPU workloads.

show abstract

Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects

Cited by 9 publications

References 46 publications

Analyzing Vectorized Hash Tables across CPU Architectures

Analyzing Vectorized Hash Tables across CPU Architectures

Strega : An HTTP Server for FPGAs

Improved data transfer efficiency for scale‐out heterogeneous workloads using on‐the‐fly I/O link compression

Contact Info

Product

Resources

About