Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domainspecific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic technique called GraphGen to overcome all of these limitations. Graph-Gen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics. Our code is released at: https://github.com/idea-iitd/graphgen.
How many structurally different microscopic routes are accessible to a protein molecule while folding? This has been a challenging question to address experimentally as single-molecule studies are constrained by the limited number of observed folding events while ensemble measurements, by definition, report only an average and not the distribution of the quantity under study. Atomistic simulations, on the other hand, are restricted by sampling and the inability to reproduce thermodynamic observables directly. We overcome these bottlenecks in the current work and provide a quantitative description of folding pathway heterogeneity by developing a comprehensive, scalable and yet experimentally consistent approach combining concepts from statistical mechanics, physical kinetics and graph theory. We quantify the folding pathway heterogeneity of five single-domain proteins under two thermodynamic conditions from an analysis of 100 000 folding events generated from a statistical mechanical model incorporating the detailed energetics from more than a million conformational states. The resulting microstate energetics predicts the results of protein engineering experiments, the thermodynamic stabilities of secondary-structure segments from NMR studies, and the end-to-end distance estimates from single-molecule force spectroscopy measurements. We find that a minimum of ∼3-200 microscopic routes, with a diverse ensemble of transition-path structures, are required to account for the total folding flux across the five proteins and the thermodynamic conditions. The partitioning of flux amongst the numerous pathways is shown to be subtly dependent on the experimental conditions that modulate protein stability, topological complexity and the structural resolution at which the folding events are observed. Our predictive methodology thus reveals the presence of rich ensembles of folding mechanisms that are generally invisible in experiments, reconciles the contradictory observations from experiments and simulations and provides an experimentally consistent avenue to quantify folding heterogeneity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.