Efficient storage and querying of RDF data is of increasing importance, due to the increased popularity and widespread acceptance of RDF on the web and in the enterprise. In this paper, we describe a novel storage and query mechanism for RDF which works on top of existing relational representations. Reliance on relational representations of RDF means that one can take advantage of 35+ years of research on efficient storage and querying, industrial-strength transaction support, locking, security, etc. However, there are significant challenges in storing RDF in relational, which include data sparsity and schema variability. We describe novel mechanisms to shred RDF into relational, and novel query translation techniques to maximize the advantages of this shredded representation. We show that these mechanisms result in consistently good performance across multiple RDF benchmarks, even when compared with current state-of-the-art stores. This work provides the basis for RDF support in DB2 v.10.1.
This paper presents a method that allows a replicated database system to provide a global isolation level stronger than the isolation level provided on each individual database replica. We propose a new multi-version concurrency control algorithm called, serializable generalized snapshot isolation (SGSI), that targets middleware replicated database systems. Each replica runs snapshot isolation locally and the replication middleware guarantees global one-copy serializability. We introduce novel techniques to provide a stronger global isolation level, namely readset extraction and enhanced certification that prevents readwrite and write-write conflicts in a replicated setting. We prove the correctness of the proposed algorithm, and build a prototype replicated database system to evaluate SGSI performance experimentally. Extensive experiments with an 8 replica database system under the TPC-W workload mixes demonstrate the practicality and low overhead of the algorithm. I. INTRODUCTIONIn many server systems replication is used to achieve higher performance and availability than a centralized server. Replication in database systems is, however, particularly challenging because the transactional semantics have to be maintained. The effects of an update transaction at one replica have to be efficiently propagated and synchronized at all other replicas, while maintaining consistency for all update and read-only transactions. This challenge has long been recognized [18], leading to several replication protocols that explicitly tradeoff consistency to achieve higher performance: A replicated database system may provide a lower isolation level than a centralized database system.We show, contrary to common belief, that a replicated database system can efficiently provide a global isolation level stronger than the isolation level provided by the constituent replicas. We focus here on snapshot isolated database systems and introduce a concurrency control algorithm that guarantees global one-copy serializability (1SR), while each replica guarantees snapshot isolation (SI), which is weaker than serializability. We support this claim by proposing an algorithm, proving its correctness, implementing it, and building a prototype replicated database to evaluate it experimentally.Database engines such as PostgreSQL and Oracle support SI as it provides attractive performance for an important class of transactional workloads that have certain properties, e.g., dominance of read-only transactions, short updates and absence of write hot-spots. To take advantage of this performance gain, some database engines support SI in addition to traditional locking schemes. For example, Microsoft SQL Server supports both SI and 2PL. The performance gain of using SI comes at a correctness cost: SI is not serializable.
Supervised learning models often perform poorly at low-shot tasks, i.e. tasks for which little labeled data is available for training. One prominent approach for improving low-shot learning is to use unsupervised pre-trained neural models. Another approach is to obtain richer supervision by collecting annotator rationales (explanations supporting label annotations). In this work, we combine these two approaches to improve lowshot text classification with two novel methods: a simple bag-of-words embedding approach; and a more complex context-aware method, based on the BERT model. In experiments with two English text classification datasets, we demonstrate substantial performance gains from combining pre-training with rationales. Furthermore, our investigation of a range of train-set sizes reveals that the simple bag-of-words approach is the clear top performer when there are only a few dozen training instances or less, while more complex models, such as BERT or CNN, require more training data to shine.
Adaptive join algorithms have recently attracted a lot of attention in emerging applications where data are provided by autonomous data sources through heterogeneous network environments. Their main advantage over traditional join techniques is that they can start producing join results as soon as the first input tuples are available, thus, improving pipelining by smoothing join result production and by masking source or network delays. In this paper, we first propose Double Index NEsted-loops Reactive join (DINER), a new adaptive two-way join algorithm for result rate maximization. DINER combines two key elements: an intuitive flushing policy that aims to increase the productivity of in-memory tuples in producing results during the online phase of the join, and a novel reentrant join technique that allows the algorithm to rapidly switch between processing in-memory and disk-resident tuples, thus, better exploiting temporary delays when new data are not available. We then extend the applicability of the proposed technique for a more challenging setup: handling more than two inputs. Multiple Index NEsted-loop Reactive join (MINER) is a multiway join operator that inherits its principles from DINER. Our experiments using real and synthetic data sets demonstrate that DINER outperforms previous adaptive join algorithms in producing result tuples at a significantly higher rate, while making better use of the available memory. Our experiments also shows that in the presence of multiple inputs, MINER manages to produce a high percentage of early results, outperforming existing techniques for adaptive multiway join.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.