An adaptive concurrent priority queue for NUMA architectures

Strati, Foteini; Giannoula, Christina; Siakavaras, Dimitrios; Goumas, Georgios; Koziris, Nectarios

doi:10.1145/3310273.3323164

Cited by 8 publications

(5 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Even with fine-grained parallel retrieve data transfers at rank granularity, the amount of padding needed in the equally-wide and variable-sized schemes is at 88.6% and 88.0%, respectively, causing high bottlenecks in the narrow memory bus. Therefore, in PIM systems that do not support very fine-grained parallel transfers to gather results from PIM-enabled memory to the host CPU at DRAM bank granularity, execution is highly limited by the amount of padding performed in retrieve data transfers, which can be very large in irregular workloads [22,56,60,63,80,82,83,94,104,121,152,167,194,201,225,229,249] like the SpMV kernel.…”

Section: Observation 12mentioning

confidence: 99%

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Giannoula¹,

Fernandez²,

Gómez-Luna³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given.This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

show abstract

Section: Observation 12mentioning

confidence: 99%

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Giannoula¹,

Fernandez²,

Gómez-Luna³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Implementing concurrent queues is a widely studied topic [2,3,6,9,18,24,29,31]. Below we focus on the most relevant works.…”

Section: Related Workmentioning

confidence: 99%

Jiffy: A Fast, Memory Efficient, Wait-Free Multi-Producers Single-Consumer Queue

Adas,

Friedman

2020

Preprint

View full text Add to dashboard Cite

In applications such as sharded data processing systems, sharded in-memory key-value stores, data flow programming and load sharing applications, multiple concurrent data producers are feeding requests into the same data consumer. This can be naturally realized through concurrent queues, where each consumer pulls its tasks from its dedicated queue. For scalability, wait-free queues are often preferred over lock based structures.The vast majority of wait-free queue implementations, and even lock-free ones, support the multi-producer multi-consumer model. Yet, this comes at a premium, since implementing waitfree multi-producer multi-consumer queues requires utilizing complex helper data structures. The latter increases the memory consumption of such queues and limits their performance and scalability. Additionally, many such designs employ (hardware) cache unfriendly memory access patterns.In this work we study the implementation of wait-free multi-producer single-consumer queues. Specifically, we propose Jiffy, an efficient memory frugal novel wait-free multi-producer singleconsumer queue and formally prove its correctness. We then compare the performance and memory requirements of Jiffy with other state of the art lock-free and wait-free queues. We show that indeed Jiffy can maintain good performance with up to 128 threads, delivers up to 50% better throughput than the next best construction we compared against, and consumes ≈90% less memory.

show abstract

“…Graph coloring assigns colors to the vertices of a graph such that any two adjacent vertices have different colors. Graph coloring kernel is widely used in many important real-world applications including the conflicting job scheduling [1][2][3][4][5], register allocation [6][7][8][9][10], sparse linear algebra [11][12][13][14], machine learning (e.g., to select non-similar samples that form an effective training set), and chromatic scheduling of graph processing applications [15][16][17][18]. For instance, the chromatic scheduling execution is as follows: given the vertex coloring of a graph, chromatic scheduling performs N steps that are executed serially, where N is the number of colors used to color the graph, and at each step the vertices assigned to the same color are processed in parallel, i.e., representing independent tasks that are executed concurrently.…”

Section: Introductionmentioning

confidence: 99%

“…Fig 17. Speedup achieved by all parallel graph coloring implementations over the sequential Greedy scheme in large real-world graphs using the maximum hardware thread capacity of an Intel Broadwell server with hyperthreading enabled (88 threads)…”

mentioning

confidence: 99%

High-performance and balanced parallel graph coloring on multicore platforms

et al. 2022

Self Cite

View full text Add to dashboard Cite

Graph coloring is widely used to parallelize scientific applications by identifying subsets of independent tasks that can be executed simultaneously. Graph coloring assigns colors the vertices of a graph, such that no adjacent vertices have the same color. The number of colors used corresponds to the number of parallel steps in a real-world end-application. Therefore, the total runtime of the graph coloring kernel adds to the overall parallel overhead of the real-world end-application, whereas the number of the vertices of each color class determines the number of the independent concurrent tasks of each parallel step, thus affecting the amount of parallelism and hardware resource utilization in the execution of the real-world end-application. In this work, we propose a high-performance graph coloring algorithm, named ColorTM, that leverages Hardware Transactional Memory (HTM) to detect coloring inconsistencies between adjacent vertices. ColorTM detects and resolves coloring inconsistencies between adjacent vertices with an eager approach to minimize data access costs, and implements a speculative synchronization scheme to minimize synchronization costs and increase parallelism. We extend our proposed algorithmic design to propose a balanced graph coloring algorithm, named BalColorTM, with which all color classes include almost the same number of vertices to achieve high parallelism and resource utilization in the execution of the real-world end-applications. We evaluate ColorTM and BalColorTM using a wide variety of large real-world graphs with diverse characteristics. ColorTM and BalColorTM improve performance by 12.98$$\times$$ × and 1.78$$\times$$ × on average using 56 parallel threads compared to prior state-of-the-art approaches. Moreover, we study the impact of our proposed graph coloring algorithmic designs on a popular end-application, i.e., Community Detection, and demonstrate the ColorTM and BalColorTM can provide high-performance improvements in real-world end-applications across various input data given.

show abstract

An adaptive concurrent priority queue for NUMA architectures

Cited by 8 publications

References 38 publications

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Jiffy: A Fast, Memory Efficient, Wait-Free Multi-Producers Single-Consumer Queue

High-performance and balanced parallel graph coloring on multicore platforms

Contact Info

Product

Resources

About