Abstract:Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering
functional aggregate queries
(FAQ) in which some of the input factors are defined by a collection of additive inequalities between variables. We refer to these queries as FAQ-AI for short.
To answer FAQ-AI in the Boolean semiring, we define
relaxed
tree decompositions and
relaxed
… Show more
“…• > , where and are constants and are the features. The e cient computation of aggregates conditioned on additive inequalities calls for new algorithms beyond the classical ones for theta joins [2,20]. Similar aggregates are derived for -means clustering [2].…”
Section: Turn the ML Problem Into A Db Problemmentioning
This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task.The tutorial has the following parts:(1) Database research for data science (2) Three main ideas to achieve performance improvements (2.1) Turn the ML problem into a DB problem (2.2) Exploit structure of the data and problem (2.3) Exploit engineering tools of a DB researcher (3) Avenues for future research CCS CONCEPTS
“…• > , where and are constants and are the features. The e cient computation of aggregates conditioned on additive inequalities calls for new algorithms beyond the classical ones for theta joins [2,20]. Similar aggregates are derived for -means clustering [2].…”
Section: Turn the ML Problem Into A Db Problemmentioning
This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task.The tutorial has the following parts:(1) Database research for data science (2) Three main ideas to achieve performance improvements (2.1) Turn the ML problem into a DB problem (2.2) Exploit structure of the data and problem (2.3) Exploit engineering tools of a DB researcher (3) Avenues for future research CCS CONCEPTS
“…Chapter 7 presents the details on the rewritings for a range of non-polynomial loss functions. We first presented these contributions in [5].…”
Section: Rewriting Of Data-intensive Computation Into Aggregate Queriesmentioning
confidence: 99%
“…For the k-means clustering, we show how the data-intensive computation of the algorithm can be reformulated into aggregate queries with additive inequalities. We first presented this reformulation in [5].…”
Section: Rewriting Of Data-intensive Computation Into Aggregate Queriesmentioning
confidence: 99%
“…Using this approach, the overall runtime for the evaluation of aggregate queries with additive inequalities can be faster by a polynomial factor than existing approaches. We first presented this insight in [5].…”
Section: Runtime Bounds For End-to-end Pipelinementioning
confidence: 99%
“…The publication also presents a novel algorithm for the evaluation of aggregate queries, called #PANDA [5]. In contrast to the factorized evaluation algorithms considered in this thesis, #PANDA decomposes the query into several subproblems and then solves each subproblem with the respective optimal tree decomposition.…”
Section: Runtime Bounds For End-to-end Pipelinementioning
First and foremost, I would like to express my sincere gratitude to my supervisor Prof. Dan Olteanu for his guidance, invaluable advice, and the many hours we spent discussing the project. His support was a crucial factor of the success of this thesis, and I could not have asked for a better supervisor.A big, heartfelt thank you goes out to my colleagues at relationalAI, in particular Hung Ngo, Mahmoud Abo Khamis, and Long Nguyen, for their contributions to much of the work presented in this thesis. The thesis would not have been possible without their support. I also thank Molham Aref for providing me with the opportunity to spend one summer at relationalAI in Berkeley.I appreciate all the support and companionship my colleagues from the Oxford FDB group both in the lab and outside. Their presence has made the past years significantly more enjoyable.On a more personal note, I owe eternal gratitude to my family, in particular my parents and my brother, for their continuous support and encouragement. I would not have been able to pursue this DPhil without their support.Lastly and most importantly, thank you Julia, for your support, love, and patience. I look forward to a lifetime of adventures with you.
This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research.The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database.The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.