Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets

Barik, Titus; Lubick, Kevin; SMITH, JEAN C.; Slankas, John; Murphy-Hill, Emerson

doi:10.1109/msr.2015.70

Cited by 25 publications

(28 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This corpus is of a particular interest, since it provides access to real-world business spreadsheets used in industry. The third corpus is FUSE [3] that contains 249, 376 unique spreadsheets, extracted from Common Crawl 8 .…”

Section: Dataset Of Annotated Tablesmentioning

confidence: 99%

Table Identification and Reconstruction in Spreadsheets

Koci

Thiele

Romero

et al. 2017

Advanced Information Systems Engineering

View full text Add to dashboard Cite

Abstract. Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata.To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually.To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristicsbased method for discovering tables in spreadsheets, given that each cell is classied as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and eectively identies tables within partially structured spreadsheets.

show abstract

Section: Dataset Of Annotated Tablesmentioning

confidence: 99%

Table Identification and Reconstruction in Spreadsheets

Koci

Thiele

Romero

et al. 2017

Advanced Information Systems Engineering

View full text Add to dashboard Cite

show abstract

“…To evaluate the grammar, we attempt to parse a total of 8 577 426 unique formulas. These originate from the 3 major datasets available in the spreadsheet research community, the EUSES dataset, published in 2005 and consisting of 4498 spreadsheets, the Enron email corpus, which became available after the Enron company declared bankruptcy in 2001, consisting of 16 190 spreadsheets, and the recently published FUSE corpus, consisting of 249 376 spreadsheets, along with a fourth dataset of 109 475 spreadsheets that we accumulated through crawling the WikiLeaks website. The original spreadsheets in the datasets are of various Excel versions.…”

Section: Evaluation and Datasetmentioning

confidence: 99%

“…We identified and examined 2 existing grammars: the official, published grammar for Excel formulas 21 and the grammar implemented by the formula parser of the Apache POI Java API for Microsoft Documents*, and found that neither of them fulfills those requirements.In this paper, we present a grammar that can support research on spreadsheet formulas. We further use the grammar to analyze more than 8 million unique formulas originating from the 3 major datasets available in the spreadsheet research community, namely, EUSES, 17 Enron, 18 and FUSE,19 along with a fourth dataset of that we accumulated through crawling the WikiLeaks website. The goal of the analysis is to obtain an understanding of how people program in the spreadsheets formula language by quantitatively evaluating the characteristics of spreadsheet formulas in terms of complexity, functionality, and data utilization.…”

mentioning

confidence: 99%

Parsing Excel formulas: A grammar and its application on 4 large datasets

Aivaloglou

Hoepelman

Hermans

2017

J Software Evolu Process

View full text Add to dashboard Cite

Spreadsheets are popular end user programming tools, especially in the industrial world. This makes them interesting research targets. However, there does not exist a reliable grammar that is concise enough to facilitate formula parsing and analysis and to support research on spreadsheet codebases. This paper presents a grammar for spreadsheet formulas that can successfully parse 99.99% of more than 8 million unique formulas extracted from 4 spreadsheet datasets. Our grammar is compatible with the spreadsheet formula language, recognizes the spreadsheet formula elements that are required for supporting spreadsheets research, and produces parse trees aimed at further manipulation and analysis. Additionally, we use the grammar to analyze the characteristics of the formulas of the 4 datasets in 3 different dimensions: complexity, functionality, and data utilization. Our results show that (1) most Excel formulas are simple, however formulas with more than 50 functions or operations exist, (2) almost all formulas use data from other cells, which is often not local, and (3) a surprising number of referring mechanisms are used by less than 1% of the formulas.

show abstract

“…This corpus is of particular interest, since it provides access to real-world business spreadsheets used in industry. The third corpus is Fuse (Barik et al, 2015) that contains 249, 376 unique spreadsheets, extracted from Common Crawl 1 . Each spreadsheet in Fuse is accompanied by a JSON file that contains metadata and statistics.…”

Section: Spreadsheet Corpora and Training Datamentioning

confidence: 99%

A Machine Learning Approach for Layout Inference in Spreadsheets

Koci

Thiele

Romero

et al. 2016

Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

View full text Add to dashboard Cite

Abstract:Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach delivers very high accuracy bringing us a crucial step closer towards automatic table extraction.

show abstract

Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets

Cited by 25 publications

References 9 publications

Table Identification and Reconstruction in Spreadsheets

Table Identification and Reconstruction in Spreadsheets

Parsing Excel formulas: A grammar and its application on 4 large datasets

A Machine Learning Approach for Layout Inference in Spreadsheets

Contact Info

Product

Resources

About