2015 IEEE/ACM 12th Working Conference on Mining Software Repositories 2015
DOI: 10.1109/msr.2015.70
|View full text |Cite
|
Sign up to set email alerts
|

Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(28 citation statements)
references
References 9 publications
0
28
0
Order By: Relevance
“…This corpus is of a particular interest, since it provides access to real-world business spreadsheets used in industry. The third corpus is FUSE [3] that contains 249, 376 unique spreadsheets, extracted from Common Crawl 8 .…”
Section: Dataset Of Annotated Tablesmentioning
confidence: 99%
“…This corpus is of a particular interest, since it provides access to real-world business spreadsheets used in industry. The third corpus is FUSE [3] that contains 249, 376 unique spreadsheets, extracted from Common Crawl 8 .…”
Section: Dataset Of Annotated Tablesmentioning
confidence: 99%
“…To evaluate the grammar, we attempt to parse a total of 8 577 426 unique formulas. These originate from the 3 major datasets available in the spreadsheet research community, the EUSES dataset, published in 2005 and consisting of 4498 spreadsheets, the Enron email corpus, which became available after the Enron company declared bankruptcy in 2001, consisting of 16 190 spreadsheets, and the recently published FUSE corpus, consisting of 249 376 spreadsheets, along with a fourth dataset of 109 475 spreadsheets that we accumulated through crawling the WikiLeaks website. The original spreadsheets in the datasets are of various Excel versions.…”
Section: Evaluation and Datasetmentioning
confidence: 99%
“…We identified and examined 2 existing grammars: the official, published grammar for Excel formulas 21 and the grammar implemented by the formula parser of the Apache POI Java API for Microsoft Documents*, and found that neither of them fulfills those requirements.In this paper, we present a grammar that can support research on spreadsheet formulas. We further use the grammar to analyze more than 8 million unique formulas originating from the 3 major datasets available in the spreadsheet research community, namely, EUSES, 17 Enron, 18 and FUSE,19 along with a fourth dataset of that we accumulated through crawling the WikiLeaks website. The goal of the analysis is to obtain an understanding of how people program in the spreadsheets formula language by quantitatively evaluating the characteristics of spreadsheet formulas in terms of complexity, functionality, and data utilization.…”
mentioning
confidence: 99%
“…This corpus is of particular interest, since it provides access to real-world business spreadsheets used in industry. The third corpus is Fuse (Barik et al, 2015) that contains 249, 376 unique spreadsheets, extracted from Common Crawl 1 . Each spreadsheet in Fuse is accompanied by a JSON file that contains metadata and statistics.…”
Section: Spreadsheet Corpora and Training Datamentioning
confidence: 99%