Single-version locking schedulerProving the single-version locking scheme correct is trivial, as the scheduler is a 2PL scheduler. Multi-version pessimistic (locking) schedulerThe multi-version pessimistic (locking) scheme is in fact a MV2PL scheduler. Holding a certify (commit) lock on a data item in MV2PL is exactly like having the NoMoreReadLocks bit set in the latest version of the data item in our implementation (see Section 4.2.1). Section 5.5.2 of [WV02] describes MV2PL in detail and proves it only admits 1SR multi-version histories. Multi-version optimistic schedulerLet us now prove that the multi-version optimistic scheduler only admits 1SR multi-version histories. We use the notation and theorems from Section 5.2 of [BHG87]. The multi-version optimistic scheduler behaves like a MVTO scheduler, with the changes described below.Let transaction Tx be a committed transaction with a Begin timestamp of TxBegin and an End timestamp of TxEnd.Property 1: Timestamps are assigned in a monotonically increasing order, and each transaction has a unique begin and end timestamp, such that TxBegin < TxEnd.Property 2: A given version is valid for the interval specified by the begin and end timestamps. There is a total order << of versions for a given datum, as determined by the timestamp order of the nonoverlapping version validity intervals.Property 3: The transaction Tx reads the latest committed version as of TxRead (where TxBegin <= TxRead < TxEnd) and validates (that is, repeats) the read of the latest committed version as of TxEnd. The transaction fails if the two reads return different versions.Property 4: Updates or deletes to a version V first check the visibility of V. Checking the visibility of V is equivalent to reading V. Therefore, a write is always preceded by a read: if transaction Tx writes Vnew, then transaction Tx has first read Vold, where Vold << Vnew. Moreover, there exists no version V such that Vold << V << Vnew, otherwise Tx would have never committed: it would have failed during the Active phase when changing the end timestamp of Vold (see Section 3.1, paragraph "Update version") 1 .1 Notice that all our concurrency control algorithms enforce a stronger property: they use the first-writer-wins rule to abort transactions that participate in a write-write conflict before it is determined whether the first writer will commit. The more relaxed property described here is sufficient to prove correctness.
Hekaton is a new database engine optimized for memory resident data and OLTP workloads. Hekaton is fully integrated into SQL Server; it is not a separate system. To take advantage of Hekaton, a user simply declares a table memory optimized. Hekaton tables are fully transactional and durable and accessed using T-SQL in the same way as regular SQL Server tables. A query can reference both Hekaton tables and regular tables and a transaction can update data in both types of tables. T-SQL stored procedures that reference only Hekaton tables can be compiled into machine code for further performance improvements. The engine is designed for high concurrency. To achieve this it uses only latch-free data structures and a new optimistic, multiversion concurrency control technique. This paper gives an overview of the design of the Hekaton engine and reports some experimental results.
Recent OLTP support exploits new techniques, running on modern hardware, to achieve unprecedented performance compared with prior approaches. In SQL Server, the Hekaton main-memory database engine embodies this new OLTP support. Hekaton uses the Bw-tree to achieve its great indexing performance. The Bw-Tree is a latch-free B-tree index that also exploits log-structured storage when used "beyond" Hekaton as a separate key value store. It is designed from the ground up to address two hardware trends:(1) Multi-core and main memory hierarchy: the Bw-tree is completely latch-free, using an atomic compare-and-swap instruction to install state changes on a "page address" mapping table; it performs updates as "deltas" to avoid updatein-place. These improve performance by eliminating thread blocking while improving cache hit ratios. (2) flash storage: the Bw-tree organizes secondary storage in a log-structured manner, using large sequential writes to avoid entirely the adverse performance impact of random writes. We demonstrate the architectural versatility and performance of the Bw-tree in two scenarios: (a) running live within Hekaton and (2) running as a standalone key value store compared to both BerkeleyDB and a state-of-the-art in-memory range index (latch-free skiplists). Using workloads from real-world applications (Microsoft XBox Live Primetime and enterprise deduplication), we show the Bw-tree is 19x faster than BerkeleyDB and 3x faster than skiplists.
Azure SQL Database and the upcoming release of SQL Server introduce a novel database recovery mechanism that combines traditional ARIES recovery with multi-version concurrency control to achieve database recovery in constant time, regardless of the size of user transactions. Additionally, our algorithm enables continuous transaction log truncation, even in the presence of long running transactions, thereby allowing large data modifications using only a small, constant amount of log space. These capabilities are particularly important for any Cloud database service given a) the constantly increasing database sizes, b) the frequent failures of commodity hardware, c) the strict availability requirements of modern, global applications and d) the fact that software upgrades and other maintenance tasks are managed by the Cloud platform, introducing unexpected failures for the users. This paper describes the design of our recovery algorithm and demonstrates how it allowed us to improve the availability of Azure SQL Database by guaranteeing consistent recovery times of under 3 minutes for 99.999% of recovery cases in production.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.