Proceedings of the 18th Annual International Conference on Digital Government Research 2017
DOI: 10.1145/3085228.3085278
|View full text |Cite
|
Sign up to set email alerts
|

Unleashing Tabular Content to Open Data

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
12
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(14 citation statements)
references
References 18 publications
0
12
0
1
Order By: Relevance
“…Due to the popularity of the PDF format, PDF annotation has received considerable attention (e.g., [18,46,52]). PDF documents are widely used among various domains, for example, in government data [13], legal documents [30], patents [5] and product datasheets [54]. However, the PDF format hinders access and reuse of the data presented within the documents [14].…”
Section: Background and Related Workmentioning
confidence: 99%
“…Due to the popularity of the PDF format, PDF annotation has received considerable attention (e.g., [18,46,52]). PDF documents are widely used among various domains, for example, in government data [13], legal documents [30], patents [5] and product datasheets [54]. However, the PDF format hinders access and reuse of the data presented within the documents [14].…”
Section: Background and Related Workmentioning
confidence: 99%
“…Tools that specifically focus on table extraction from PDF files use segmentation techniques to estimate the position of rows and columns [7]. Corrêa et al did a literature survey on table extraction tools [3]. They concluded that Tabula 4 is the most suitable open-source tool.…”
Section: Related Workmentioning
confidence: 99%
“…For example, challenges related to publishing semantic open government data are similar to the challenges in our research. This includes extracting data from legacy documents, often in PDF format [2,3]. Furthermore, in the literature use cases are described on publishing unstructured data as semantic data (e.g., [9,20,27]).…”
Section: Related Workmentioning
confidence: 99%
“…Otro aspecto a considerar está basado en el tipo de documento, por ejemplo, Correa y Zander [7] analizaron un grupo de métodos y herramientas enfocados en extraer contenido tabular de archivos PDF basándose en dos características principales: facilidad de uso y resultados de salida y la categorización de las herramientas según propuestas teóricas, sin costo y comerciales. En [8] se desarrollaron varias heurísticas, que conjuntamente reconocen y descomponen tablas en archivos PDF y almacenan los datos extraídos en un formato estructurado de datos (XML) para facilitar su uso, estas heurísticas se dividen en dos grupos: reconocimiento y descomposición de tablas.…”
Section: Introductionunclassified