Unleashing Tabular Content to Open Data

Corrêa, Andreiwid Sheffer; Zander, Pär-Ola

doi:10.1145/3085228.3085278

Cited by 24 publications

(14 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Due to the popularity of the PDF format, PDF annotation has received considerable attention (e.g., [18,46,52]). PDF documents are widely used among various domains, for example, in government data [13], legal documents [30], patents [5] and product datasheets [54]. However, the PDF format hinders access and reuse of the data presented within the documents [14].…”

Section: Background and Related Workmentioning

confidence: 99%

Crowdsourcing Scholarly Discourse Annotations

Oelen

Stocker

Auer

2021

26th International Conference on Intelligent User Interfaces

View full text Add to dashboard Cite

The number of scholarly publications grows steadily every year and it becomes harder to find, assess and compare scholarly knowledge effectively. Scholarly knowledge graphs have the potential to address these challenges. However, creating such graphs remains a complex task. We propose a method to crowdsource structured scholarly knowledge from paper authors with a web-based user interface supported by artificial intelligence. The interface enables authors to select key sentences for annotation. It integrates multiple machine learning algorithms to assist authors during the annotation, including class recommendation and key sentence highlighting. We envision that the interface is integrated in paper submission processes for which we define three main task requirements: The task has to be (1) straightforward (2) time efficient (3) well-defined. We evaluated the interface with a user study in which participants were assigned the task to annotate one of their own articles. With the resulting data, we determined whether the participants were successfully able to perform the task. Furthermore, we evaluated the interface's usability and the participant's attitude towards the interface with a survey. The results suggest that sentence annotation is a feasible task for researchers and that they do not object to annotate their articles during the submission process. CCS CONCEPTS• Human-centered computing → Web-based interaction; • Information systems → Web interfaces; Crowdsourcing.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Crowdsourcing Scholarly Discourse Annotations

Oelen

Stocker

Auer

2021

26th International Conference on Intelligent User Interfaces

View full text Add to dashboard Cite

show abstract

“…Tools that specifically focus on table extraction from PDF files use segmentation techniques to estimate the position of rows and columns [7]. Corrêa et al did a literature survey on table extraction tools [3]. They concluded that Tabula 4 is the most suitable open-source tool.…”

Section: Related Workmentioning

confidence: 99%

“…For example, challenges related to publishing semantic open government data are similar to the challenges in our research. This includes extracting data from legacy documents, often in PDF format [2,3]. Furthermore, in the literature use cases are described on publishing unstructured data as semantic data (e.g., [9,20,27]).…”

Section: Related Workmentioning

confidence: 99%

Creating a Scholarly Knowledge Graph from Survey Article Tables

Oelen

Stocker

Auer

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Due to the lack of structure, scholarly knowledge remains hardly accessible for machines. Scholarly knowledge graphs have been proposed as a solution. Creating such a knowledge graph requires manual effort and domain experts, and is therefore time-consuming and cumbersome. In this work, we present a human-in-the-loop methodology used to build a scholarly knowledge graph leveraging literature survey articles. Survey articles often contain manually curated and high-quality tabular information that summarizes findings published in the scientific literature. Consequently, survey articles are an excellent resource for generating a scholarly knowledge graph. The presented methodology consists of five steps, in which tables and references are extracted from PDF articles, tables are formatted and finally ingested into the knowledge graph. To evaluate the methodology, 92 survey articles, containing 160 survey tables, have been imported in the graph. In total, 2 626 papers have been added to the knowledge graph using the presented methodology. The results demonstrate the feasibility of our approach, but also indicate that manual effort is required and thus underscore the important role of human experts.

show abstract

“…Otro aspecto a considerar está basado en el tipo de documento, por ejemplo, Correa y Zander [7] analizaron un grupo de métodos y herramientas enfocados en extraer contenido tabular de archivos PDF basándose en dos características principales: facilidad de uso y resultados de salida y la categorización de las herramientas según propuestas teóricas, sin costo y comerciales. En [8] se desarrollaron varias heurísticas, que conjuntamente reconocen y descomponen tablas en archivos PDF y almacenan los datos extraídos en un formato estructurado de datos (XML) para facilitar su uso, estas heurísticas se dividen en dos grupos: reconocimiento y descomposición de tablas.…”

Section: Introductionunclassified

Algoritmos para el reconocimiento de estructuras de tablas

Escalona

2020

Ingenius

View full text Add to dashboard Cite

Las Tablas son una manera bien común de organizar y publicar datos. Por ejemplo, la Web posee un enorme número de tablas publicadas en HTML integradas en documentos PDF, o que pueden ser simplemente descargadas de páginas Web. Sin embargo, las tablas no siempre son fáciles de interpretar pues poseen una gran variedad de características y son organizadas en diferentes formatos. De hecho, se han desarrollado un gran número de métodos y herramientas para la interpretación de tablas. Este trabajo presenta la implementación de un algoritmo, basado en Campos Aleatorios Condicionales (CRF, Conditional Random Fields), para clasificar las filas de una tabla como fila de encabezado, fila de datos y fila metadatos. La implementación se complementa con dos algoritmos para reconocer tablas en hojas de cálculos, específicamente, basados en reglas y detección de regiones. Finalmente, el trabajo describe los resultados y beneficios obtenidos por la aplicación del algoritmo para tablas HTML, obtenidas desde la Web, y las tablas en forma de hojas de cálculo, descargadas desde el sitio Web de la Agencia Nacional de Petróleo de Brasil.

show abstract

Unleashing Tabular Content to Open Data

Cited by 24 publications

References 18 publications

Crowdsourcing Scholarly Discourse Annotations

Crowdsourcing Scholarly Discourse Annotations

Creating a Scholarly Knowledge Graph from Survey Article Tables

Algoritmos para el reconocimiento de estructuras de tablas

Contact Info

Product

Resources

About