2021
DOI: 10.48550/arxiv.2106.07258
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GitTables: A Large-Scale Corpus of Relational Tables

Abstract: The practical success of deep learning has sparked interest in improving relational table tasks, like data search, with models trained on large table corpora. Existing corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need additional resources with tables that resemble relational database tables.Here we introduce GitTables, a corpus of currently 1.7M relational … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 19 publications
(39 reference statements)
0
5
0
Order By: Relevance
“…Pylon Benchmark. We create a new dataset from Git-Tables [25], a data lake of 1.7M tables extracted from CSV files on GitHub. The benchmark comprises 1,746 tables including union-able table subsets under topics selected from Schema.org [26]: scholarly article, job posting, and music playlist.…”
Section: Methodsmentioning
confidence: 99%
“…Pylon Benchmark. We create a new dataset from Git-Tables [25], a data lake of 1.7M tables extracted from CSV files on GitHub. The benchmark comprises 1,746 tables including union-able table subsets under topics selected from Schema.org [26]: scholarly article, job posting, and music playlist.…”
Section: Methodsmentioning
confidence: 99%
“…We believe that this gap can be attributed to the datasets used to pretrain these models, which mainly represent tables from the Web. Such tables can only partially represent tables found in enterprise databases [18,25,43]. This affects the applicability of concurrent pretrained table models to downstream tasks on typical "offline" databases.…”
Section: Unrepresentative Training Datamentioning
confidence: 99%
“…Unlike large corpora of text extracted from the Web which are shown to be instrumental for pretraining widely used language models [3,9], pretrained table models have shown less impact in this regard. In fact, the generalizability of models trained towards typical database tables is found to be limited [18,25].…”
Section: Relevant Training Datamentioning
confidence: 99%
See 2 more Smart Citations