Table Identification and Reconstruction in Spreadsheets

Koci, Elvis; Thiele, Maik; Romero, Óscar; Lehner, Wolfgang

doi:10.1007/978-3-319-59536-8_33

Cited by 21 publications

(17 citation statements)

References 12 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…There is a considerable number of works tackling layout inference and information extraction in spreadsheets. Recent publications propose approaches involving to some extent machine learning techniques, such as [2], [3], [4], [5], and [6]. Also, we find rule-based approaches, like [7].…”

Section: Related Workmentioning

confidence: 80%

See 1 more Smart Citation

Table Recognition in Spreadsheets via a Graph Representation

Koci

Thiele

Lehner

et al. 2018

2018 13th IAPR International Workshop on Document Analysis Systems (DAS)

Self Cite

View full text Add to dashboard Cite

Spreadsheet software are very popular data management tools. Their ease of use and abundant functionalities equip novices and professionals alike with the means to generate, transform, analyze, and visualize data. As a result, spreadsheets are a great resource of factual and structured information. This accentuates the need to automatically understand and extract their contents. In this paper, we present a novel approach for recognizing tables in spreadsheets. Having inferred the layout role of the individual cells, we build layout regions. We encode the spatial interrelations between these regions using a graph representation. Based on this, we propose Remove and Conquer (RAC), an algorithm for table recognition that implements a list of carefully curated rules. An extensive experimental evaluation shows that our approach is viable. We achieve significant accuracy in a dataset of real spreadsheets from various domains.

show abstract

Section: Related Workmentioning

confidence: 80%

“…In a similar fashion to [4], we then use the inferred roles to create the so-called layout regions (see Figure 1c). These group together adjacent cells having the same layout role.…”

Section: Introductionmentioning

confidence: 99%

Table Recognition in Spreadsheets via a Graph Representation

Koci

Thiele

Lehner

et al. 2018

2018 13th IAPR International Workshop on Document Analysis Systems (DAS)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We see recognition and information extraction in spreadsheets as a series of steps, which collectively form our processing pipeline, illustrated in Figure 1. Although we cover various aspects of automatic spreadsheet processing, our research focuses mainly on two crucial tasks: layout inference [13,15] and table identification [10][11][12]14]. Subsequently, we adapt approaches from related work, to extract the information from the detected tables.…”

Section: Processing Pipelinementioning

confidence: 99%

“…We have proposed several approaches for table recognition in spreadsheets [10,12,14]. Initially we employed heuristic-and rulebased methods.…”

Section: Table Recognitionmentioning

confidence: 99%

“…Recent attempts, such as Ideas in Excel and Explore in Google Sheets, aim at providing insights and recommendations to users (e.g., summary statistics and charts), based on background analysis of tabular data in the sheet. Other works [1,3,5,17], including ours [10][11][12][13][14][15], focus on integrating and extracting data from spreadsheets. One of the main concerns comes with data and knowledge being scattered in multiple spreadsheet files.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

XLIndy

Koci

Kuban

Luettig

et al. 2019

Proceedings of the ACM Symposium on Document Engineering 2019

Self Cite

View full text Add to dashboard Cite

Over the years, spreadsheets have established their presence in many domains, including business, government, and science. However, challenges arise due to spreadsheets being partially-structured and carrying implicit (visual and textual) information. This translates into a bottleneck, when it comes to automatic analysis and extraction of information. Therefore, we present XLIndy, a Microsoft Excel add-in with a machine learning back-end, written in Python. It showcases our novel methods for layout inference and table recognition in spreadsheets. For a selected task and method, users can visually inspect the results, change configurations, and compare different runs. This enables iterative fine-tuning. Additionally, users can manually revise the predicted layout and tables, and subsequently save them as annotations. The latter is used to measure performance and (re-)train classifiers. Finally, data in the recognized tables can be extracted for further processing. XLIndy supports several standard formats, such as CSV and JSON.

show abstract

Demeter: An automatic framework for data migration in open data lakes

Kim,

Han,

Son

et al. 2023

Softw Pract Exp

View full text Add to dashboard Cite

An open data lake stores various forms and types of open data, and there is an increasing demand to manage raw data in tables rather than files for efficient data exploration and analysis. In this paper, we investigate the data management of open data lakes and recognize the limitations of table migration and related problems. First, open data lakes have problems of preprocessing complexity, scale limitation, and platform dependency due to the traditional data management method and open data characteristics. Second, existing studies for table migration have problems of lack of scalability, migration incompleteness, and scale limitation. In this work, we present a novel automation framework, called Demeter, which solves three problems inherent in open data lakes by expanding automation. Specifically, it supports automating catalog collection and preprocessing tasks to solve preprocessing complexity and scale limitation. It also supports platform universality for representative data platforms through the automation of catalog analysis and detailed processing logic. Demeter then solves three problems in table migration by adopting Airbyte, an open‐source ELT platform, and by enhancing automation capability with the Airbyte manager. We verify that Demeter resolves all the problems above through extensive experiments and proves its scalability and universality. In addition, significantly outperforms CKAN by Demeter up to 508.5% in automation performance, up to 207.28% in processing time, and up to 917.17% in migration performance. These results indicate that Demeter is an excellent automation framework that increases the utilization of large‐scale open data and supports reliable Internet‐scale migration.

show abstract

Table Identification and Reconstruction in Spreadsheets

Cited by 21 publications

References 12 publications

Table Recognition in Spreadsheets via a Graph Representation

Table Recognition in Spreadsheets via a Graph Representation

XLIndy

Demeter: An automatic framework for data migration in open data lakes

Contact Info

Product

Resources

About