GitTables: A Large-Scale Corpus of Relational Tables

Hulsebos, Madelon; Demiralp, Çağatay; Groth, Paul

doi:10.48550/arxiv.2106.07258

Cited by 3 publications

(5 citation statements)

References 19 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pylon Benchmark. We create a new dataset from Git-Tables [25], a data lake of 1.7M tables extracted from CSV files on GitHub. The benchmark comprises 1,746 tables including union-able table subsets under topics selected from Schema.org [26]: scholarly article, job posting, and music playlist.…”

Section: Methodsmentioning

confidence: 99%

Pylon: Semantic Table Union Search in Data Lakes

Cong¹,

Nargesian²,

Jagadish³

2023

Preprint

View full text Add to dashboard Cite

The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from disparate sources. In this work, we address one important class of discovery needs: finding unionable tables.The task is to find tables in a data lake that can be unioned with a given query table. The challenge is to recognize unionable columns even if they are represented differently. In this paper, we propose a data-driven learning approach: specifically, an unsupervised representation learning and embedding retrieval task. Our key idea is to exploit self-supervised contrastive learning to learn an embedding model that takes into account the indexing/search data structure and produces embeddings close by for columns with semantically similar values while pushing apart columns with semantically dissimilar values. We then find union-able tables based on similarities between their constituent columns in embedding space. On a real-world data lake, we demonstrate that our best-performing model achieves significant improvements in precision (16% ↑), recall (17% ↑), and query response time (7x faster) compared to the state-of-the-art.

show abstract

Section: Methodsmentioning

confidence: 99%

Pylon: Semantic Table Union Search in Data Lakes

Cong¹,

Nargesian²,

Jagadish³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…We believe that this gap can be attributed to the datasets used to pretrain these models, which mainly represent tables from the Web. Such tables can only partially represent tables found in enterprise databases [18,25,43]. This affects the applicability of concurrent pretrained table models to downstream tasks on typical "offline" databases.…”

Section: Unrepresentative Training Datamentioning

confidence: 99%

“…Unlike large corpora of text extracted from the Web which are shown to be instrumental for pretraining widely used language models [3,9], pretrained table models have shown less impact in this regard. In fact, the generalizability of models trained towards typical database tables is found to be limited [18,25].…”

Section: Relevant Training Datamentioning

confidence: 99%

“…As traditional large-scale table corpora do not extend well to database tables [18,25], newer data sources [15,18,43] aim at capturing more database-like tables. Since SIGMATYPER is intended to operate on enterprise tables, we use the GitTables [18] corpus to train it. GitTables has been recently introduced to address the need for database tables to train models for enterprise applications.…”

Section: Input Data and Labelsmentioning

confidence: 99%

“…Second, these models are pretrained on tables that poorly resemble typical database tables, as the training data mostly reflects tables found on the Web [18,25]. Therefore, tuning a pretrained model or retraining a task-specific model towards a representative data distribution and labels still takes many resources.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Making Table Understanding Work in Practice

Hulsebos¹,

Gathani²,

Gale³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap between the performance of these models on these benchmarks and their applicability in practice. In this paper, we address the question: what do we need for these models to work in practice?We discuss three challenges of deploying table understanding models and propose a framework to address them. These challenges include 1) difficulty in customizing models to specific domains, 2) lack of training data for typical database tables often found in enterprises, and 3) lack of confidence in the inferences made by models. We present SIGMATYPER which implements this framework for the semantic column type detection task. SIGMATYPER encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model. Lastly, we highlight avenues for future research that further close the gap towards making table understanding effective in practice.

show abstract