To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a "think like a vertex" programming model to support iterative graph computation. This vertex-centric model is easy to program and has been proved useful for many graph algorithms. However, this model hides the partitioning information from the users, thus prevents many algorithm-specific optimizations. This often results in longer execution time due to excessive network messages (e.g. in Pregel) or heavy scheduling overhead to ensure data consistency (e.g. in GraphLab). To address this limitation, we propose a new "think like a graph" programming paradigm. Under this graph-centric model, the partition structure is opened up to the users, and can be utilized so that communication within a partition can bypass the heavy message passing or scheduling machinery. We implemented this model in a new system, called Giraph++, based on Apache Giraph, an open source implementation of Pregel. We explore the applicability of the graph-centric model to three categories of graph algorithms, and demonstrate its flexibility and superior performance, especially on well-partitioned data. For example, on a web graph with 118 million vertices and 855 million edges, the graph-centric version of connected component detection algorithm runs 63X faster and uses 204X fewer network messages than its vertex-centric counterpart.
We investigate the incremental validation of XML documents with respect to DTDs, specialized DTDs, and XML Schemas, under updates consisting of element tag renamings, insertions, and deletions. DTDs are modeled as extended context-free grammars. "Specialized DTDs" allow the decoupling of element types from element tags. XML Schemas are abstracted as specialized DTDs with limitations on the type assignment. For DTDs and XML Schemas, we exhibit an O(m log n) incremental validation algorithm using an auxiliary structure of size O(n), where n is the size of the document and m the number of updates. The algorithm does not handle the incremental validation of XML Schema wrt renaming of internal nodes, which is handled by the specialized DTDs incremental validation algorithm. For specialized DTDs, we provide an O(m log 2 n) incremental algorithm, again using an auxiliary structure of size O(n). This is a significant improvement over brute-force re-validation from scratch.We exhibit a restricted class of DTDs called local that arise commonly in practice and for which incremental validation can be done in practically constant time by maintaining only a list of counters. We present implementations of both general incremental validation and local validation on an XML database built on top of a relational database.Our experimentation includes a study of the applicability of local validation in practice, results on the calibration of parameters of the auxiliary data structure, and results on the performance comparison between the general incremental validation technique, the local validation technique, and brute-force validation from scratch.
XML languages, such as XQuery, XSLT and SQL/XML, employ XPath as the search and extraction language. XPath expressions often define complicated navigation, resulting in expensive query processing, especially when executed over large collections of documents. In this paper, we propose a framework for exploiting materialized XPath views to expedite processing of XML queries. We explore a class of materialized XPath views, which may contain XML fragments, typed data values, full paths, node references or any combination thereof. We develop an XPath matching algorithm to determine when such views can be used to answer a user query containing XPath expressions. We use the match information to identify the portion of an XPath expression in the user query which is not covered by the XPath view. Finally, we construct, possibly multiple, compensation expressions which need to be applied to the view to produce the query result. Experimental evaluation, using our prototype implementation, shows that the matching algorithm is very efficient and usually accounts for a small fraction of the total query compilation time.
DSDSDGLP#DQGHUVRQXFODHGX $QDO\VWV DQG GHFLVLRQPDNHUV XVH ZKDWLI DQDO\VLV WR DVVHVV WKH HIIHFWV RI K\SRWKHWLFDO VFHQDULRV RQ KLVWRULFDO GDWD )RU H[DPSOH DQ DQDO\VW ZRUNLQJ IRU D ILQDQFLDO FRPSDQ\ PD\ FRQVWUXFW D VFHQDULR IRU D VWUDWHJ\ WKDW ZRXOG LQYROYH KLJKHU SRVLWLRQV LQ VWRFNV RI ODUJH KLJKWHFK FRPSDQLHV IRU WKH ODVW \HDUV 7KH K\SRWKHWLFDO ZRUOG FUHDWHG E\ WKLV VFHQDULR ZLOO EH TXHULHG WR FDOFXODWH WKH UHWXUQV DQG YRODWLOLW\ RI FXVWRPHU SRUWIROLRV XQGHU WKLV VFHQDULR &XUUHQW 2Q/LQH $QDO\WLFDO 3URFHVVLQJ 2/$3 V\VWHPV VXSSRUW ZKDWLI DQDO\VLV RQO\ E\ SK\VLFDOO\ UHSOLFDWLQJ WKH GDWD ZDUHKRXVH DQG PRGLI\LQJ LW DFFRUGLQJ WR WKH VFHQDULR 7KLV SURFHVV PD\ WDNH PDQ\ KRXUV KHQFH OLPLWLQJ WKH DSSOLFDELOLW\ RI 2/$3 7R HOLPLQDWH WKLV LQHIILFLHQF\ ZH EXLOW DQ 2/$3 WRRONLW FDOOHG 6HVDPH WKDW H[SORLWV WKH IROORZLQJ WZR RSSRUWXQLWLHV )LUVW W\SLFDOO\ D VPDOO SDUW RI WKH PRGLILHG GDWD LV QHHGHG WR DQVZHU WKH K\SRWKHWLFDO TXHU\ )RU H[DPSOH WKH DQDO\VW PLJKW E\ LQWHUHVWHG LQ UHVXOWV RI WKH QHZ VWUDWHJ\ RQO\ GXULQJ FHUWDLQ SHULRGV RI WLPH DQG PD\EH RI WKH PRGLILFDWLRQV ZHUH QRW QHHGHG 6HFRQG GDWD ZDUHKRXVHV WHQG WR KDYH SUHFRPSXWHG PDWHULDOL]HG YLHZV WR KHOS DQVZHU SRSXODU TXHULHV )RU H[DPSOH WKH ILQDQFLDO GDWD ZDUHKRXVH PRVW OLNHO\ ZLOO KDYH PDWHULDOL]HG VXPV RI WUDQVDFWLRQ KLVWRU\ E\ VWRFN W\SH 7KLV YLHZ ZLOO LQFOXGH WKH WRWDO SRVLWLRQV LQ ODUJH KLJKWHFK FRPSDQLHV RQ ZKLFK ZH FDQ OHYHUDJH ZKHQ FRPSXWLQJ WKH K\SRWKHWLFDO TXHU\ DERYH 7R H[SORLW WKH ILUVW RSSRUWXQLW\ 6HVDPH LQWURGXFHG OD]\ HYDOXDWLRQ RI K\SRWKHWLFDO TXHULHV 1R PRGLILFDWLRQV DUH GRQH EHIRUH WKH XVHU LVVXHV D TXHU\ ,QVWHDG D VFHQDULR LV PRGHOHG DV DQ RUGHUHG VHW RI K\SRWKHWLFDO YLHZ GHILQLWLRQV 7KH 6HVDPH RSWLPL]HU EDVHG RQ WHUP UHZULWLQJ FRQYHUWV WKH K\SRWKHWLFDO TXHU\ LQWR RQH UHIHUULQJ RQO\ WR DFWXDO GDWD 7KH 6HVDPH UHZULWHU OHYHUDJHV RQ PDWHULDOL]HG YLHZV WR PLQLPL]H WKH H[HFXWLRQ WLPH $OVR 6HVDPH LV WKH ILUVW 2/$3 V\VWHP WKDW DOORZV LQWHUIDFLQJ RI DUELWUDU\ RSHUDWRUV ZULWWHQ LQ -DYD ZKLFK FDQ EH XVHG LQ XVHU TXHULHV DV ZHOO DV LQ YLHZ GHILQLWLRQV
Abstract-A large scale network of social interactions, such as mentions in Twitter, can often be modeled as a "dynamic interaction graph" in which new interactions (edges) are continually added over time. Existing systems for extracting timely insights from such graphs are based on either a cumulative "snapshot" model or a "sliding window" model. The former model does not sufficiently emphasize recent interactions. The latter model abruptly forgets past interactions, leading to discontinuities in which, e.g., the graph analysis completely ignores historically important influencers who have temporarily gone dormant. We introduce TIDE, a distributed system for analyzing dynamic graphs that employs a new "probabilistic edge decay" (PED) model. In this model, the graph analysis algorithm of interest is applied at each time step to one or more graphs obtained as samples from the current "snapshot" graph that comprises all interactions that have occurred so far. The probability that a given edge of the snapshot graph is included in a sample decays over time according to a user specified decay function. The PED model allows controlled trade-offs between recency and continuity, and allows existing analysis algorithms for static graphs to be applied to dynamic graphs essentially without change. For the important class of exponential decay functions, we provide efficient methods that leverage past samples to incrementally generate new samples as time advances. We also exploit the large degree of overlap between samples to reduce memory consumption from O(N ) to O(log N ) when maintaining N sample graphs. Finally, we provide bulk-execution methods for applying graph algorithms to multiple sample graphs simultaneously without requiring any changes to existing graph-processing APIs. Experiments on a real Twitter dataset demonstrate the effectiveness and efficiency of our TIDE prototype, which is built on top of the Spark distributed computing framework.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.