The Graph Convolutional Network (GCN) model and its variants are powerful graph embedding tools for facilitating classification and clustering on graphs. However, a major challenge is to reduce the complexity of layered GCNs and make them parallelizable and scalable on very large graphs -state-of the art techniques are unable to achieve scalability without losing accuracy and efficiency. In this paper, we propose novel parallelization techniques for graph samplingbased GCNs that achieve superior scalable performance on very large graphs without compromising accuracy. Specifically, our GCN guarantees work-efficient training and produces order of magnitude savings in computation and communication. To scale GCN training on tightly-coupled shared memory systems, we develop parallelization strategies for the key steps in training: For the graph sampling step, we exploit parallelism within and across multiple sampling instances, and devise an efficient data structure for concurrent accesses that provides theoretical guarantee of near-linear speedup with number of processing units. For the feature propagation step within the sampled graph, we improve cache utilization and reduce DRAM communication by data partitioning. We prove that our partitioning strategy is a 2approximation for minimizing the communication time compared to the optimal strategy. We demonstrate that our parallel graph embedding outperforms state-of-the-art methods in scalability (with respect to number of processors, graph size and GCN model size), efficiency and accuracy on several large datasets. On a 40-core Xeon platform, our parallel training achieves 64× speedup (with AVX) in the sampling step and 25× speedup in the feature propagation step, compared to the serial implementation, resulting in a net speedup of 21×. Our scalable algorithm enables deeper GCN, as demonstrated by 1306× speedup on a 3-layer GCN compared to Tensorflow implementation of state-of-the-art.
Anytime you work with oranges and apples, you'll need a number of schemes to organize total performance. This article surveys the challenges posed by H omogeneous computing, which uses one or more machines of the same type, has provided adequate performance for many applications in the past. Many of these applications had more than one type of embedded parallelism, such as single instruction, multiple data (SIMD) and multiple instruction, multiple data (MIMD). Most of the current parallel machines are suited only for homogeneous computing. However, numerous applications that have more than one type of embedded parallelism are now being considered for parallel implementation. On the other hand, as the amount of homogeneous parallelism in applications decreases, homogeneous systems cannot offer the desired speedups. To exploit the heterogeneity in computations, researchers are investigating a suite of heterogeneous architectures.Heterogeneous computing (HC) is the well-orchestrated and coordinated effective use of a suite of diverse high-performance machines (including parallel machines) to provide superspeed processing for computationally demanding tasks with diverse computing needs.' An HC system includes heterogeneous machines, high-speed networks, interfaces, operating systems, communication protocols, and programming environments, all combining to produce a positive impact on ease of use and performance. Figure 1 shows an example HC environment.Heterogeneous computing should be distinguished from network computing or high-performance distributed computing, which have generally come to mean either clusters of workstations or ad hoc connectivity among computers using little more than opportunistic load-balancing. HC is a plausible, novel technique for heterogeneous solving computationally intensive problems that have several types of embedded computing and parallelism. HC also helps to reduce design risks by incorporating proven technology and existing designs instead of developing them from scratch. However, discusses some several issues and problems arise from employing this technique, which we discuss.In the past few years, several technical meetings have addressed many of these approaches to opening issues. There is also a growing interest in using this paradigm to solve Grand up its opportunities.
Pattern matching for network security and intrusion detection demands exceptionally high performance. Much work has been done in this field, and yet there is still significant room for improvement in efficiency, flexibility, and throughput. We develop a novel linear-array string matching architecture using a buffered, two-comparator variation on the Knuth-Morris-Pratt(KMP) algorithm. For small (16 or fewer characters) patterns, it compares favorably with the state-of-the-art while providing better scalability and reconfiguration, and more efficient hardware utilization.KMP is a well-known, efficient string matching technique using a single comparator and a precomputed transition table. We add a second comparator and an input buffer, allowing the system to accept at least one character in each cycle and terminate after a number of clock cycles at maximum equal to the length of the input string plus the size of the buffer. The system also provides a clean, modular route to reconfiguring the patterns on-the-fly and scaling the system to support more units, using several rows of linear array elements. In this paper, we prove the bound on the buffer size and running time, and provide performance comparisons against other approaches.
Large scale graph processing is a major research area for Big Data exploration. Vertex centric programming models like Pregel are gaining traction due to their simple abstraction that allows for scalable execution on distributed systems naturally. However, there are limitations to this approach which cause vertex centric algorithms to under-perform due to poor compute to communication overhead ratio and slow convergence of iterative superstep. In this paper we introduce GoFFish a scalable sub-graph centric framework co-designed with a distributed persistent graph storage for large scale graph analytics on commodity clusters. We introduce a sub-graph centric programming abstraction that combines the scalability of a vertex centric approach with the flexibility of shared memory sub-graph computation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.