Graph neural networks
(GNNs) constitute a class of deep learning
methods for graph data. They have wide applications in chemistry and
biology, such as molecular property prediction, reaction prediction,
and drug–target interaction prediction. Despite the interest,
GNN-based modeling is challenging as it requires graph data preprocessing
and modeling in addition to programming and deep learning. Here, we
present Deep Graph Library (DGL)-LifeSci, an open-source package for
deep learning on graphs in life science. Deep Graph Library (DGL)-LifeSci
is a python toolkit based on RDKit, PyTorch, and Deep Graph Library
(DGL). DGL-LifeSci allows GNN-based modeling on custom datasets for
molecular property prediction, reaction prediction, and molecule generation.
With its command-line interfaces, users can perform modeling without
any background in programming and deep learning. We test the command-line
interfaces using standard benchmarks MoleculeNet, USPTO, and ZINC.
Compared with previous implementations, DGL-LifeSci achieves a speed
up by up to 6×. For modeling flexibility, DGL-LifeSci provides
well-optimized modules for various stages of the modeling pipeline.
In addition, DGL-LifeSci provides pretrained models for reproducing
the test experiment results and applying models without training.
The code is distributed under an Apache-2.0 License and is freely
accessible at .
Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To tackle this challenge, we develop DistDGL, a system for training GNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight mincut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability. We demonstrate our optimizations on both inductive and transductive GNN models. Our results show that DistDGL achieves linear speedup without compromising model accuracy and requires only 13 seconds to complete a training epoch for a graph with 100 million nodes and 3 billion edges on a cluster with 16 machines.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.