2004
DOI: 10.1007/978-3-540-28640-0_20
|View full text |Cite
|
Sign up to set email alerts
|

Layout and Content Extraction for PDF Documents

Abstract: Abstract. Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logical components were extracted and expressed in an XML format. These techniques could facilitate the reuse an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
48
0
3

Year Published

2006
2006
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 73 publications
(58 citation statements)
references
References 14 publications
0
48
0
3
Order By: Relevance
“…(Seki, 2007) analyses document structure for simultaneous management of information in documents from various formats (image, PDF, and HTML). Most other contributions aim at extracting (some part of) the document content by means of a syntactic parsing of the PDF (Futrelle, 2003;Chao, 2003;Ramel, 2003) or at discovering the background by means of statistical analysis applied to the numerical features of the documents and its components (Chao, 2004).…”
Section: Related Workmentioning
confidence: 99%
“…(Seki, 2007) analyses document structure for simultaneous management of information in documents from various formats (image, PDF, and HTML). Most other contributions aim at extracting (some part of) the document content by means of a syntactic parsing of the PDF (Futrelle, 2003;Chao, 2003;Ramel, 2003) or at discovering the background by means of statistical analysis applied to the numerical features of the documents and its components (Chao, 2004).…”
Section: Related Workmentioning
confidence: 99%
“…Chao et al [2] reported their work on extract the layout and content from PDF documents. Hadjar et al have developed a tool for extracting the structures from PDF documents.…”
Section: Related Work On Table Detectionmentioning
confidence: 99%
“…PDF is a widely used document format in digital libraries because it can preserve the appearance of the original document. Although a good number of researches have been done to discover the document layout by converting the PDFs to other types of files (e.g., image, html, text) in the past two decades, automatically identifying the document logical structures information (e.g., words, text lines, paragraphs, etc) and extracting the document components (e.g., figures, tables, mathematical formulas, etc) as well as the content [2] are still a challenging problem. The major reasons are as follows: 1) the structural information is not explicitly marked up because of the un-tagged nature of PDF format; 2) the text sequences are often messily generated by the existing PDF-to-text tools; 3) new noises can be generated by some necessary tools (e.g., OCR), if converting the PDFs into other media (e.g., image).…”
Section: Introductionmentioning
confidence: 99%
“…Although many researchers analyze PDF documents by converting them to other formats (e.g., image, html), automatically identifying the PDF document logical structures information and document components (e.g., figures, tables, etc) are still challenging problems [2] because of three main reasons: 1) extracted texts from PDF files are non-tagged; 2) wrong text sequences are generated by the text extraction tools; 3) new noises are caused by necessary tools (e.g., OCR) when converting the PDFs into other format (e.g., image).…”
Section: Introductionmentioning
confidence: 99%