Pump Up the Volume

Lutz, Clemens; Breß, Sebastian; Zeuch, Steffen; Rabl, Tilmann; Markl, Volker

doi:10.1145/3318464.3389705

Cited by 50 publications

(5 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Different communication interfaces might play a key role in alleviating this effect, such as streaming [14], where queries can be fed to the accelerator, executed, and returned while the rest of the stream is still being processed. Future developments of interconnects, such as CXL and NVLink, and accelerator interfaces have the potential of lifting the current limitations and enable new use cases [15,17]. Notably, GPU starvation by lack of incoming data from one of the for NVIDIA's CPU architecture, which provides a much higher bandwidth between CPU and GPU than conventional approaches [7,8].…”

Section: Optimising the Input Channelmentioning

confidence: 99%

“…They also show the increasing amount of compute power, network, and memory bandwidth needed on the CPU side to be able to match the throughput of the accelerator. A number of other studies have also explored this problem [11,17,32] confirming that the advantages a hardware accelerator can bring are bound by the ability to generate enough load on it, often leading to a situation where the accelerated system is larger than the initial one. This issue is one of the reasons why new processor architectures are emerging that try to avoid these bottlenecks [7,8] and new standards for peripheral interconnects are appearing [6,24,30].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Difficult Balance Between Modern Hardware and Conventional CPUs

Maschi

Alonso

2023

Proceedings of the 19th International Workshop on Data Management on New Hardware

View full text Add to dashboard Cite

Research has demonstrated the potential of accelerators in a wide range of use cases. However, there is a growing imbalance between modern hardware and the CPUs that submit the workload. Recent studies of GPUs on real systems have shown that many servers are often needed per accelerator to generate a high enough load so the computing power is leveraged. This fact is often ignored in research, although it often determines the actual feasibility and overall efficiency of a deployment. In this paper, we conduct a detailed study of the possible configurations and overall cost efficiency of deploying an FPGA-based accelerator on a commercial search engine. First, we show that there are many possible configurations balancing the upstream system and the way the accelerator is configured. Of these configurations, not all of them are suitable in practice, even if they provide some of the highest throughput. Second, we analyse the cost of a deployment capable of sustaining the required workload of the commercial search engine. We examine deployments both on-premises and in the cloud with and without FPGAs and with different board models. The results show that, while FPGAs have the potential to significantly improve overall performance, the performance imbalance between their host CPUs and the FPGAs can make the deployments economically unattractive. These findings are intended to inform the development and deployment of accelerators by showing what is needed on the CPU side to make them effective and also to provide important insights into their end-to-end integration within existing systems. CCS CONCEPTS• Hardware → Hardware accelerators; • Computer systems organization → Cloud computing; Client-server architectures; Heterogeneous (hybrid) systems.

show abstract

Section: Optimising the Input Channelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

The Difficult Balance Between Modern Hardware and Conventional CPUs

Maschi

Alonso

2023

Proceedings of the 19th International Workshop on Data Management on New Hardware

View full text Add to dashboard Cite

show abstract

“…Генерация кода для оператора (цепочки операторов, оканчивающихся pipeline-breaker -оператором, требующим готовности всего результата до исполнения следующего), совместно с пакетной обработкой данных позволяет значительно сократить пересылку данных по медленной PCIe и, тем самым, добиться кратного роста скорости исполнения запроса [2]. Увеличение пропускной способности шины демонстрирует значительное повышение производительности исполнения запроса в целом [3], даже в случае входных данных, не помещающихся в память целиком (morsel-driven).…”

Section: Hybrid Execution Of Queries To Analytical Databasesunclassified

Hybrid Execution of Queries to Analytical Databases

Kurapov

2021

IZVESTIYA SFedU. ENGINEERING SCIENCES

View full text Add to dashboard Cite

show abstract

“…In the age of ever‐increasing data volumes, the overhead of data transfers is a major inhibitor of further performance improvements on many levels. In heterogeneous compute architectures, the overhead of transferring data (e.g., between host and graphics processing unit (GPU) memory) can still have a major impact on the overall performance, even when the latest state‐of‐the‐art interconnection technologies are used such as NVLink‐2 on the intra‐node level 1,2 and InfiniBand EDR on the inter‐node level 1 . For many data‐intensive applications, scaling out to multiple nodes is the most feasible strategy to satisfy their resource demands.…”

Section: Introductionmentioning

confidence: 99%

Improved data transfer efficiency for scale‐out heterogeneous workloads using on‐the‐fly I/O link compression

Plauth

Micó

Polze

2020

Concurrency and Computation

View full text Add to dashboard Cite

Graphics processing units (GPUs) are unarguably vital to keep up with the perpetually growing demand for compute capacity of data-intensive applications. However, the overhead of transferring data between host and GPU memory is already a major limiting factor on the single-node level. The situation intensifies in scale-out scenarios, where data movement is becoming even more expensive. By augmenting the CloudCL framework with 842-based compression facilities, this article demonstrates that transparent on-the-fly I/O link compression can yield performance improvements between 1.11× and 2.07× across tested scale-out GPU workloads.

show abstract

Pump Up the Volume

Cited by 50 publications

References 41 publications

The Difficult Balance Between Modern Hardware and Conventional CPUs

The Difficult Balance Between Modern Hardware and Conventional CPUs

Hybrid Execution of Queries to Analytical Databases

Improved data transfer efficiency for scale‐out heterogeneous workloads using on‐the‐fly I/O link compression

Contact Info

Product

Resources

About