Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads

Basak, Abanti; Li, Shuangchen; Hu, Xing; Oh, Sang Min; Xie, Xiujuan; Li, Zhao; Jiang, Xiaowei; Xie, Yuan

doi:10.1109/hpca.2019.00051

Cited by 66 publications

(29 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Across a complete set of 29 workloads, we show a significant average speedup of 2.6× and energy savings of 1.6× compared to a non-prefetching baseline. Using our evaluation framework, we further show that Prodigy outperforms IMP [99], Ainsworth and Jones' prefetcher [6], and DROPLET [15] by 2.3×, 1.5×, and 1.6×, respectively. The compact DIG representation allows Prodigy to achieve high speedups at a mere 0.8KB of hardware storage overhead.…”

Section: Introductionmentioning

confidence: 83%

“…Hardware prefetchers rely on capturing memory access patterns using explicit programmer support [5], [6], learning techniques [77], and intelligent hardware structures [99]. Limitations of these approaches include their limited applicability to a subset of data structures and indirect memory access patterns [6], [15], [99] or high complexity and hardware cost to support generalization [5], [77]. While software prefetching [7] can exploit static semantic view of algorithms, it lacks dynamic run-time information and struggles to maintain prefetch timeliness.…”

Section: Introductionmentioning

confidence: 99%

“…For evaluation, we use five graph algorithms from the GAP benchmark suite [16] with five real-world large-scale data sets from [27], [59], two sparse linear algebra algorithms from the HPCG benchmark suite [29], and two computational fluid dynamics algorithms from the NAS parallel benchmark suite [12]. We compare our design with a non-prefetching baseline, GHBbased global/delta correlation (G/DC) data prefetcher, and stateof-the-art prefetchers, i.e., IMP [99], Ainsworth and Jones' [5], [6], DROPLET [15], and software prefetching [8].…”

Section: Introductionmentioning

confidence: 99%

“…2 presents a highlight of performance benefits of Prodigy on the PageRank algorithm running on the livejournal data set [59]. Compared to a non-prefetching baseline, Prodigy reduces the DRAM stalls by 8.2× resulting in a significant endto-end speedup of 2.9× compared to the marginal speedups observed using a traditional G/DC prefetcher that cannot predict irregular memory access patterns and DROPLET [15] which only prefetches a subset of data structures. Section VI presents further comparisons with [5]- [7], [99].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Talati

May

Behroozi

et al. 2021

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Irregular workloads are typically bottlenecked by the memory system. These workloads often use sparse data representations, e.g., compressed sparse row/column (CSR/CSC), to conserve space at the cost of complicated, irregular traversals. Such traversals access large volumes of data and offer little locality for caches and conventional prefetchers to exploit. This paper presents Prodigy, a low-cost hardware-software codesign solution for intelligent prefetching to improve the memory latency of several important irregular workloads. Prodigy targets irregular workloads including graph analytics, sparse linear algebra, and fluid mechanics that exhibit two specific types of datadependent memory access patterns. Prodigy adopts a "best of both worlds" approach by using static program information from software, and dynamic run-time information from hardware. The core of the system is the Data Indirection Graph (DIG)-a proposed compact representation used to express program semantics such as the layout and memory access patterns of key data structures. The DIG representation is agnostic to a particular data structure format and is demonstrated to work with several sparse formats including CSR and CSC. Program semantics are automatically captured with a compiler pass, encoded as a DIG, and inserted into the application binary. The DIG is then used to program a low-cost hardware prefetcher to fetch data according to an irregular algorithm's data structure traversal pattern. We equip the prefetcher with a flexible prefetching algorithm that maintains timeliness by dynamically adapting its prefetch distance to an application's execution pace.We evaluate the performance, energy consumption, and transistor cost of Prodigy using a variety of algorithms from the GAP, HPCG, and NAS benchmark suites. We compare the performance of Prodigy against a non-prefetching baseline as well as stateof-the-art prefetchers. We show that by using just 0.8KB of storage, Prodigy outperforms a non-prefetching baseline by 2.6× and saves energy by 1.6×, on average. Prodigy also outperforms modern data prefetchers by 1.5-2.3×.Index Terms-DRAM stalls, irregular workloads, graph processing, hardware-software co-design, programming model, programmer annotations, compiler, and hardware prefetching. Program the prefetcher Program the prefetcher Generate prefetch requests Generate prefetch requestsCompiler analysis Run application Software Hardware Instrumented application binary Application source code Add DIG representation Data Indirection Graph (DIG)

show abstract

Section: Introductionmentioning

confidence: 83%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Talati

May

Behroozi

et al. 2021

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…In Aggregation phase, the Computation Unit Utilization is only 50% and the Executed IPC is only 1.78 on average as shown in Table 3. The aggregation heavily relies on the graph structure so that it is obstructed by irregularity [8] and load-load data dependency chain [11]. Therefore, it is mainly stalled for Data Request and Execution Dependency as depicted in Fig.…”

Section: Analysis Of Overall Executionmentioning

confidence: 99%

Characterizing and Understanding GCNs on GPU

Yan

Chen

Deng

et al. 2020

IEEE Comput. Arch. Lett.

Self Cite

View full text Add to dashboard Cite

Graph convolutional neural networks (GCNs) have achieved state-of-the-art performance on graph-structured data analysis. Like traditional neural networks, training and inference of GCNs are accelerated with GPUs. Therefore, characterizing and understanding the execution pattern of GCNs on GPU is important for both software and hardware optimization. Unfortunately, to the best of our knowledge, there is no detailed characterization effort of GCN workloads on GPU. In this paper, we characterize GCN workloads at inference stage and explore GCN models on NVIDIA V100 GPU. Given the characterization and exploration, we propose several useful guidelines for both software optimization and hardware optimization for the efficient execution of GCNs on GPU.

show abstract

CAR: Community Aware Graph Reordering for Efficient Cache Utilization in Graph Analytics

Singhania¹,

Sharma²,

Venkitaraman³

et al. 2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads

Cited by 66 publications

References 46 publications

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Characterizing and Understanding GCNs on GPU

CAR: Community Aware Graph Reordering for Efficient Cache Utilization in Graph Analytics

Contact Info

Product

Resources

About