Tensor relational algebra for distributed machine learning system design

Yuan, Binhang; Jankov, Dimitrije; Zou, Jia; Tang, Yuxin; Bourgeois, Daniel C.; Jermaine, Chris

doi:10.14778/3457390.3457399

Cited by 18 publications

(7 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work complements prior work on linear algebra computation powered by database engines [7,14,29,35,45] and on languages that unify linear algebra and relational algebra [13,17,25]. No prior work considered the interaction of QR decomposition with database joins.…”

Section: Introductionmentioning

confidence: 74%

Givens QR Decomposition over Relational Databases

Olteanu¹,

Vortmeier²,

Živanović³

2022

Preprint

View full text Add to dashboard Cite

This paper introduces FiGaRo, an algorithm for computing the upper-triangular matrix in the QR decomposition of the matrix defined by the natural join over a relational database. The QR decomposition lies at the core of many linear algebra techniques and their machine learning applications, including: the matrix inverse; the least squares; the singular value decomposition; eigenvalue problems; and the principal component analysis.FiGaRo's main novelty is that it pushes the QR decomposition past the join. This leads to several desirable properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its execution is equivalent to the application of a sequence of Givens rotations proportional to the join size. Its number of rounding errors relative to the classical QR decomposition algorithms is on par with the input size relative to the join size.In experiments with real-world and synthetic databases, FiGaRo outperforms both in runtime performance and accuracy the LA-PACK libraries openblas and Intel MKL by a factor proportional to the gap between the join output and input sizes. CCS CONCEPTS• Information systems → Data management systems; • Mathematics of computing → Computations on matrices.

show abstract

Section: Introductionmentioning

confidence: 74%

Givens QR Decomposition over Relational Databases

Olteanu¹,

Vortmeier²,

Živanović³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There has been a plethora of work in the past decade focusing on in-DB ML [80,44,37,70,53,63,32,67,50,56,45,57,78,48]. Most existing in-DB ML systems implement SGD as "User-Defined Aggregates" (UDA) [37,44].…”

Section: In-database Machine Learning Systemsmentioning

confidence: 99%

“…In-DB ML Previous work [80,44,37,70,53,63,32,67,50,56,45,57,78,48,58,15] has intensively discussed how to implement ML models on relational data, such as linear models [70,53,63], linear algebra [32,56,57], factorization models [67], neural networks [45,57,78] and other statistical learning models [50], using Batch Gradient Descent (BGD) or SGD, over join or self-defined matrix/tensors, etc. The most common way of integrating ML algorithm into RDBMS is to use User-Defined Aggregate Functions (UDA).…”

Section: Related Workmentioning

confidence: 99%

Stochastic Gradient Descent without Full Data Shuffle

Xu¹,

Qiu²,

Yuan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access).In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they all suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PyTorch by designing new parallel/distributed shuffle operators inside a new CorgiPileDataSet API. We also integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, CorgiPile is 1.5× faster than PyTorch with full data shuffle. For in-DB ML with linear models, CorgiPile is 1.6×-12.8× faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

show abstract

“…Second, the current version of BAGUA only focuses on data parallelism and it is interesting future work to integrate other techniques such as model parallelism (e.g. [40,41,42,43,44,45,46,47]) and pipeline parallelism (e.g., [48,49,50,51]) and to understand the system abstractions.…”

Section: Limitations and Moving Forwardmentioning

confidence: 99%

“…BAGUA is built on decades of research regarding distributed machine learning systems and algorithms. Plenty of them are from the database community [52,53,54,55,56,57,58,59,60,61,46,47]. We now summarize related work and discuss some in details to provide backgrounds and contexts.…”

Section: Preliminaries and Related Workmentioning

confidence: 99%

BAGUA: Scaling up Distributed Learning with System Relaxations

Gan¹,

Lian²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via "system relaxations": quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, BAGUA has a great ability to implement and extend various state-ofthe-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 1.95×) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.

show abstract

Tensor relational algebra for distributed machine learning system design

Cited by 18 publications

References 34 publications

Givens QR Decomposition over Relational Databases

Givens QR Decomposition over Relational Databases

Stochastic Gradient Descent without Full Data Shuffle

BAGUA: Scaling up Distributed Learning with System Relaxations

Contact Info

Product

Resources

About