A Case for Work-stealing on FPGAs with OpenCL Atomics

Ramanathan, Nadesh; Wickerson, John; Winterstein, Felix; Constantinides, George A.

doi:10.1145/2847263.2847343

Cited by 30 publications

(9 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a result, underutilized PEs stealing the workload from the overloaded PEs and writing the results back to their buffers after the calculation will not payoff [14]. In addition, heavy operations (e.g., atomic operation) will stall the processing pipeline, resulting in new system bottlenecks [11]. Challenge 2: How to minimize manual efforts for skew handling?…”

Section: Challenges and Solutionsmentioning

confidence: 99%

“…Since PEs process distinctive ranges of data, skew datasets may cause some PEs overloaded or underutilized, which essentially diminishes performance. The challenge of skew handling for dataintensive applications is that the lightweight computation (e.g., the calculation with integers finished within one cycle) cannot tolerate any heavy workload rebalancing operations such as atomic-based work-stealing [11]. Besides, skew handling needs to adapt to very different data distributions in a robust manner and requires sizable hardware expertise in general; therefore, the other challenge is to minimize the manual development efforts for developers.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Skew-Oblivious Data Routing for Data-Intensive Applications on FPGAs with HLS

Chen¹,

Tan²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

FPGAs have become emerging computing infrastructures for accelerating applications in datacenters. Meanwhile, high-level synthesis (HLS) tools have been proposed to ease the programming of FPGAs. Even with HLS, irregular data-intensive applications require explicit optimizations, among which multiple processing elements (PEs) with each owning a private BRAM-based buffer are usually adopted to process multiple data per cycle. Data routing, which dynamically dispatches multiple data to designated PEs, avoids data replication in buffers compared to statically assigning data to PEs, hence saving BRAM usage. However, the workload imbalance among PEs vastly diminishes performance when processing skew datasets. In this paper, we propose a skew-oblivious data routing architecture that allocates secondary PEs and schedules them to share the workload of the overloaded PEs at run-time.In addition, we integrate the proposed architecture into a framework called Ditto to minimize the development efforts for applications that require skew handling. We evaluate Ditto on five commonly used applications: histogram building, data partitioning, pagerank, heavy hitter detection and hyperloglog. The results demonstrate that the generated implementations are robust to skew datasets and outperform the stateof-the-art designs in both throughput and BRAM usage efficiency.1 Ditto, a Transform Pokémon, which is able to reconstitute entire cellular structure to change into what it sees.

show abstract

Section: Challenges and Solutionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Skew-Oblivious Data Routing for Data-Intensive Applications on FPGAs with HLS

Chen¹,

Tan²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Neither of these works support the explicit multi-threading constructs defined by the Pthreads standard, so a direct comparison with the present work is difficult. Altera's SDK for OpenCL [3] supports lock-free programming via atomics [26], though the commercial nature of the tool makes it difficult to ascertain exactly how these operations are implemented. LEAP facilitates parallel memory access through its provision of memory hierarchies that potentially can be shared among Pthreads in a lock-free manner [32].…”

Section: High-level Synthesismentioning

confidence: 99%

Hardware Synthesis of Weakly Consistent C Concurrency

Ramanathan

Fleming

Wickerson

et al. 2017

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations ('atomics'), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This paper explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis (HLS).We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. We implement our approach in the open-source LegUp HLS framework, and provide both sequentially consistent (SC) and weakly consistent ('weak') atomics. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many concurrent algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics defined by the 2011 revision of the C standard. A case study on a circular buffer suggests that circuits synthesised from programs that use atomics can be 2.5x faster than those that use locks, and that weak atomics can yield a further 1.5x speedup. Keywordsatomic operations, C/C++, FPGAs, high-level synthesis, lock-free algorithms, memory consistency models, scheduling.

show abstract

“…However, recent devices ś including Intel's Xeon+FPGA system [Intel 2019;Oliver et al 2011], the IBM CAPI [Stuecheli et al 2015] and the Xilinx Alveo [Xilinx 2018] ś offer a fine-grained shared-memory interface between the CPU and FPGA. This enables synchronisation idioms where data is exchanged in arbitrary (potentially small) amounts, such as work stealing, which has been shown to enable significant speedups in difficult-to-accelerate applications [e.g., Farooqui et al 2016;Ramanathan et al 2016;Tzeng et al 2010].…”

Section: Introductionmentioning

confidence: 99%

The semantics of shared memory in Intel CPU/FPGA systems

Iorga

Donaldson

Sorensen

et al. 2021

Proc. ACM Program. Lang.

Self Cite

View full text Add to dashboard Cite

Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory, are becoming popular in several computing sectors. In this paper, we study the shared-memory semantics of these devices, with a view to providing a firm foundation for reasoning about the programs that run on them. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe the weak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGA thread access common memory locations in a fine-grained manner through multiple channels. Some of these behaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encode these behaviours in two formal memory models: one operational, one axiomatic. We develop executable implementations of both models, using the CBMC bounded model-checking tool for our operational model and the Alloy modelling language for our axiomatic model. Using these, we cross-check our models against each other via a translator that converts Alloy-generated executions into queries for the CBMC model. We also validate our models against actual hardware by translating 583 Alloy-generated executions into litmus tests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising a hardware design per litmus test by creating our own 'litmus-test processor' in hardware. We expect that our models will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as a demonstration of the utility of our work, we use our operational model to reason about a producer/consumer buffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation -- a situation that our model is able to detect -- we observe that its performance improves at the cost of occasional data corruption.

show abstract

A Case for Work-stealing on FPGAs with OpenCL Atomics

Cited by 30 publications

References 17 publications

Skew-Oblivious Data Routing for Data-Intensive Applications on FPGAs with HLS

Skew-Oblivious Data Routing for Data-Intensive Applications on FPGAs with HLS

Hardware Synthesis of Weakly Consistent C Concurrency

The semantics of shared memory in Intel CPU/FPGA systems

Contact Info

Product

Resources

About