Evaluation and Optimization of Breadth-First Search on NUMA Cluster

Cui, Zehan; Chen, Licheng; Chen, Mingyu; Bao, Yungang; Huang, Yongbing; Lv, Huiwei

doi:10.1109/cluster.2012.29

Cited by 6 publications

(3 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Figure 4, the Matrix-2000+ CPUs adopt a regional autonomous parallel architecture composed of several regions. Each region can be viewed as a functionally-independent SN, which has SVE (Scalable Vector Extension) configured in hardware that can be used to accelerate BFS [12][13][14][15][16][17]. Rather than using a fixed vector length, SVE allows Matrix-2000+ to choose the most appropriate vector length for applications, ranging from 128 bits up to 1024 bits per vector register file.…”

Section: Bfs With Svementioning

confidence: 99%

TianheGraph: Customizing Graph Search for Graph500 on Tianhe Supercomputer

Gan

Zhang

Wang

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

As the era of exascale supercomputing is coming, it is vital for next-generation supercomputers to find appropriate applications with high social and economic benefit. In recent years, it has been widely accepted that extremely-large graph computation is a promising killer application for supercomputing. Although Tianhe series supercomputers are leading in the world-wide competition of supercomputing (ranked No. 1 in the Top500 list for six times), previously they had been inefficient in graph computation according to the Graph500 list. This is mainly because the previous graph processing system cannot leverage the advanced hardware features of Tianhe supercomputers. To address the problem, in this paper we present our integrated optimizations for improving the graph computation performance on our next-generation exascale Tianhe supercomputing system, mainly including sorting with buffering for heavy vertices, vectorized searching with SVE (Scalable Vector Extension) on matrix2000+ CPUs, and group-based monitor communication on the proprietary interconnection network. Performance evaluation on a subset of the Tianhe exascale supercomputer (with 512 nodes and 96608 cores) shows that our customized graph processing system achieves 2131.98 GTEPS, which even outperforms the Tianhe-2 supercomputer (ranked No. 7 in Graph500 by running the state-of-the-art graph processing system) that has 16x more computing nodes.

show abstract

Section: Bfs With Svementioning

confidence: 99%

TianheGraph: Customizing Graph Search for Graph500 on Tianhe Supercomputer

Gan

Zhang

Wang

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Note that our framework is a generic graph processing framework that enables users to develop multiple applications, including BFS and SSSP, and applies optimizations in an application-agnostic way. While, the codes we compete against in Graph500 are developed for these specific applications (as published in the corresponding publications [13,38,40,41]).…”

Section: Graph500 Submissionsmentioning

confidence: 99%

Scale-Free Graph Processing on a NUMA Machine

Aasawat¹,

Reza²,

Ripeanu³

2018

2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

View full text Add to dashboard Cite

The importance of high-performance graph processing to solve big data problems targeting high-impact applications is greater than ever before. Graphs incur highly irregular memory accesses which leads to poor data locality, load imbalance, and data-dependent parallelism. Distributed graph processing frameworks, such as Google's Pregel, that employs memory-parallel, shared-nothing systems have experienced tremendous success in terms of scale and performance. Modern sharedmemory systems embrace the so called Non-Uniform Memory Access (NUMA) architecture which has proven to be more scalable (in terms of numbers of cores and memory modules) than the Symmetric Multiprocessing (SMP) architecture. In many ways, a NUMA system resembles a shared-nothing distributed system: physically distinct processing cores and memory regions (although, cache-coherent in NUMA). Memory accesses to remote NUMA domains are more expensive than local accesses. This poses the opportunity to transfer the know-how and design of distributed graph processing to develop shared-memory graph processing solutions optimized for NUMA systems (which is surprisingly little-explored).In this dissertation, we explore if a distributed-memory like middleware that makes graph partitioning and communication between partitions explicit, can improve the performance on a NUMA system. We design and implement a NUMA aware graph processing framework that treats the NUMA platform as a distributed system, and embraces its design principles; in particular explicit partitioning and inter-partition communication. We further explore design trade-offs to reduce communication overhead and propose a solution that embraces design philosophies of distributed graph processing system and at the same time exploits optimization opportunities specific to single-node systems. We demonstrate up to 13.9× speedup iii over a state-of-the-art NUMA-aware framework, Polymer and up to 3.7× scalability on a four-socket machine using graphs with tens of billions of edges.iv Preface This thesis is based on the research project done by me under the supervision and guidance of Professor Matei Ripeanu. I was responsible for the design, implementation, modeling, validation, evaluation and analysis of the results, along with taking the lead in publication writing effort. The research presented in this thesis have been either published or accepted for publication.The work that this thesis extends and evaluates against, was selected based on the following preliminary study; Professor Ripeanu and Tahsin helped me in the analysis of the results and editing the publication.

show abstract

“…Several studies focused on the performance on shared-memory nodes, for example minimizing the memory footprint of frequently accessed data (e.g. using bitmaps) (Agarwal et al, 2010;Checconi et al, 2012), or reducing intrasocket communication (Cui et al, 2012). Other work has been done on distributed BFS managed partitioning as a way to control load balance and communication (Chow et al, 2005;Yoo et al, 2005) or adopting sparse linear algebra representations to reduce the storage requirements (Gilbert et al, 2007).…”

Section: Related Workmentioning

confidence: 99%

Reducing communication in parallel graph search algorithms with software caches

Cicotti

Shantharam

Carrington

2018

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

In many scientific and computational domains, graphs are used to represent and analyze data. Such graphs often exhibit the characteristics of small-world networks: few high-degree vertexes connect many low-degree vertexes. Despite the randomness in a graph search, it is possible to capitalize on the characteristics of small-world networks and cache relevant information of high-degree vertexes. We applied this idea by caching remote vertex ids in a parallel breadth-first search benchmark. Our experiment with different implementations demonstrated significant performance improvements over the reference implementation in several configurations, using 64 to 1024 cores. We proposed a system design in which resources are dedicated exclusively to caching and shared among a set of nodes. Our evaluation demonstrates that this design reduces communication and has the potential to improve performance on large-scale systems in which the communication cost increases significantly with the distance between nodes. We also tested a memcached system as the cache server finding that its generic protocol, which does not match our usage semantics, hinders significantly the potential performance improvements and suggested that a generic system should also support a basic and lightweight communication protocol to meet the needs of high-performance computing applications. Finally, we explored different configurations to find efficient ways to utilize the resources allocated to solve a given problem size; to this extent, we found utilizing half of the compute cores per allocated node improves performance, and even in this case, caching variants always outperform the reference implementation.

show abstract

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

Cited by 6 publications

References 43 publications

TianheGraph: Customizing Graph Search for Graph500 on Tianhe Supercomputer

TianheGraph: Customizing Graph Search for Graph500 on Tianhe Supercomputer

Scale-Free Graph Processing on a NUMA Machine

Reducing communication in parallel graph search algorithms with software caches

Contact Info

Product

Resources

About