Energy-Efficient Stream Compaction Through Filtering and Coalescing Accesses in GPGPU Memory Partitions

Segura, Albert; Arnau, José-María; González, Antonio

doi:10.1109/tc.2021.3104749

Cited by 4 publications

(4 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Graphicionado [23] is the first ASICbased one and can reduce random memory accesses. ISCU [49,50] designs a compacting and filtering technique to prepare data for SMs of GPU for higher GPU utilization. However, when serving the streaming graph processing on CPU, ISCU not only suffers from serious redundant computation overhead, but also needs to issue multiple accesses for the compacting/filtering of multiple data elements that reside in multiple cache lines.…”

Section: Additional Related Workmentioning

confidence: 99%

TDGraph

Zhao

Yang

Liao

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Many solutions have been recently proposed to support the processing of streaming graphs. However, for the processing of each graph snapshot of a streaming graph, the new states of the vertices affected by the graph updates are propagated irregularly along the graph topology. Despite the years' research efforts, existing approaches still suffer from the serious problems of redundant computation overhead and irregular memory access, which severely underutilizes a many-core processor. To address these issues, this paper proposes a topology-driven programmable accelerator TDGraph, which is the first accelerator to augment the many-core processors to achieve high performance processing of streaming graphs. Specifically, we propose an efficient topology-driven incremental execution approach into the accelerator design for more regular state propagation and better data locality. TDGraph takes the vertices affected by graph updates as the roots to prefetch other vertices along the graph topology and synchronizes the incremental computations of them on the fly. In this way, most state propagations originated from multiple vertices affected by different graph updates can be conducted together along the graph topology, which help reduce the redundant computations and data access cost. Besides, through the efficient coalescing of the accesses to vertex states, TDGraph further improves the utilization of the cache and memory bandwidth. We have evaluated TDGraph on a simulated 64-core processor. The results show that, the state-of-the-art software system achieves the speedup of 7.1∼21.4 times after integrating with TDGraph, while incurring only 0.73% area cost. Compared with four cutting-edge accelerators, i.e., HATS, Minnow, PHI, and DepGraph, TDGraph gains the speedups of 4.6∼12.7, 3.2∼8.6, 3.8∼9.7, and 2.3∼6.1 times, respectively.

show abstract

Section: Additional Related Workmentioning

confidence: 99%

TDGraph

Zhao

Yang

Liao

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

show abstract

“…The efficiency of accessing memory is a deciding factor in improving performance due to the special multi-threaded execution mode of GPUs. Given that a large number of threads may issue memory access requests at the same time, those requests may be delivered to the off-chip DRAM if the storage hierarchy cannot effectively deal with said mode, leading to numerous threads being blocked and unable to secure the requested data [2,[12][13][14]. This will bring about painful repercussions.…”

Section: Introductionmentioning

confidence: 99%

WSMP: A Warp Scheduling Strategy Based on MFQ and PPF

Fang

Zhao

Cai

et al. 2022

Preprint

View full text Add to dashboard Cite

Normally, threads in a warp do not severely interfere with each other. However, the scheduler must wait until all the threads within complete before scheduling the next warp, resulting in memory divergence. The crux of the problem is scheduling the warp in a more reasonable order. Therefore, we propose a new warp scheduling strategy based on Multi-level Feedback Queue (MFQ) and Perceptron-Based Prefetch Filtering (PPF) called WSMP, where all the warps are sorted beforehand according to the latency tolerance of the warps and pushed into a certain queue in MFQ. We also remold PPF to enhance the modified underlying prefetcher. We are able to strike a balance between cache hit rate and prefetch coverage then. We verify its feasibility by GPGPU-Sim, along with exclusive GPGPU work-load. The results show that compared with baseline, WSMP improves IPC by 27.44% and reduces L2 cache miss rate by 8.53% on average.

show abstract

“…In conclusion, the ISCU leverages the strengths of our previous work on improved graph processing by offloading stream compaction operations, and our work on improved irregular accesses on GPGPU architectures which deliver synergistic improvements in efficient graph processing. This work has been submitted for publication [138].…”

Section: Contributionmentioning

confidence: 99%

“…Our design achieves high energy savings and important speedups for graph processing in modern GPGPU architectures as explored in Chapter 6. This work has been submitted for publication [138].…”

Section: Graph Processing Algorithms On Gpgpu Architecturesmentioning

confidence: 99%

High-performance and energy-efficient irregular graph processing on GPU architectures

Segura Salvador

View full text Add to dashboard Cite

Graph processing is an established and prominent domain that is the foundation of new emerging applications in areas such as Data Analytics and Machine Learning, empowering applications such as road navigation, social networks and automatic speech recognition. The large amount of data employed in these domains requires high throughput architectures such as GPGPU. Although the processing of large graph-based workloads exhibits a high degree of parallelism, memory access patterns tend to be highly irregular, leading to poor efficiency due to memory divergence.In order to ameliorate these issues, GPGPU graph applications perform stream compaction operations which process active nodes/edges so subsequent steps work on a compacted dataset. We propose to offload this task to the Stream Compaction Unit (SCU) hardware extension tailored to the requirements of these operations, which additionally performs pre-processing by filtering and reordering elements processed.We show that memory divergence inefficiencies prevail in GPGPU irregular graph-based applications, yet we find that it is possible to relax the strict relationship between thread and processed data to empower new optimizations. As such, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension integrated in the GPU pipeline that reorders and filters data processed by the threads on irregular accesses improving memory coalescing.Finally, we leverage the strengths of both previous approaches to achieve synergistic improvements. We do so by proposing the IRU-enhanced SCU (ISCU), which employs the efficient pre-processing mechanisms of the IRU to improve SCU stream compaction efficiency and NoC throughput limitations due to SCU pre-processing operations. We evaluate the ISCU with state-of-the-art graph-based applications achieving a 2.2x performance improvement and 10x energy-efficiency. El processament de grafs és un domini prominent i establert com a la base de noves aplicacions emergents en àrees com l'anàlisi de dades i Machine Learning, que permeten aplicacions com ara navegació per carretera, xarxes socials i reconeixement automàtic de veu. La gran quantitat de dades emprades en aquests dominis requereix d’arquitectures d’alt rendiment, com ara GPGPU. Tot i que el processament de grans càrregues de treball basades en grafs presenta un alt grau de paral·lelisme, els patrons d’accés a la memòria tendeixen a ser irregulars, fet que redueix l’eficiència a causa de la divergència d’accessos a memòria. Per tal de millorar aquests problemes, les aplicacions de grafs per a GPGPU realitzen operacions de stream compaction que processen nodes/arestes per tal que els passos posteriors funcionin en un conjunt de dades compactat. Proposem deslliurar d’aquesta tasca a la extensió hardware Stream Compaction Unit (SCU) adaptada als requisits d’aquestes operacions, que a més realitza un pre-processament filtrant i reordenant els elements processats.Mostrem que les ineficiències de divergència de memòria prevalen en aplicacions GPGPU basades en grafs irregulars, tot i que trobem que és possible relaxar la relació estricta entre threads i les dades processades per obtenir noves optimitzacions. Com a tal, proposem la Irregular accesses Reorder Unit (IRU), una nova extensió de maquinari integrada al pipeline de la GPU que reordena i filtra les dades processades pels threads en accessos irregulars que milloren la convergència d’accessos a memòria. Finalment, aprofitem els punts forts de les propostes anteriors per aconseguir millores sinèrgiques. Ho fem proposant la IRU-enhanced SCU (ISCU), que utilitza els mecanismes de pre-processament eficients de la IRU per millorar l’eficiència de stream compaction de la SCU i les limitacions de rendiment de NoC a causa de les operacions de pre-processament de la SCU.

show abstract

Energy-Efficient Stream Compaction Through Filtering and Coalescing Accesses in GPGPU Memory Partitions

Cited by 4 publications

References 26 publications

TDGraph

TDGraph

WSMP: A Warp Scheduling Strategy Based on MFQ and PPF

High-performance and energy-efficient irregular graph processing on GPU architectures

Contact Info

Product

Resources

About