Layout and Content Extraction for PDF Documents

Chao, Hui; Fan, Jian

doi:10.1007/978-3-540-28640-0_20

Cited by 73 publications

(58 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…(Seki, 2007) analyses document structure for simultaneous management of information in documents from various formats (image, PDF, and HTML). Most other contributions aim at extracting (some part of) the document content by means of a syntactic parsing of the PDF (Futrelle, 2003;Chao, 2003;Ramel, 2003) or at discovering the background by means of statistical analysis applied to the numerical features of the documents and its components (Chao, 2004).…”

Section: Related Workmentioning

confidence: 99%

FOL Learning for Knowledge Discovery in Documents

et al. 2012

View full text Add to dashboard Cite

This chapter proposes the application of machine learning techniques, based on first-order logic as a representation language, to the real-world application domain of document processing. First, the tasks and problems involved in document processing are presented, along with the prototypical system DOMINUS and its architecture, whose components are aimed at facing these issues. Then, a closer look is provided for the learning component of the system, and the two sub-systems that are in charge of performing supervised and unsupervised learning as a support to the system performance. Finally, some experiments are reported that assess the quality of the learning performance. This is intended to prove to researchers and practitioners of the field that first-order logic learning can be a viable solution to tackle the domain complexity, and to solve problems such as incremental evolution of the document repository.

show abstract

Section: Related Workmentioning

confidence: 99%

FOL Learning for Knowledge Discovery in Documents

et al. 2012

View full text Add to dashboard Cite

show abstract

“…Chao et al [2] reported their work on extract the layout and content from PDF documents. Hadjar et al have developed a tool for extracting the structures from PDF documents.…”

Section: Related Work On Table Detectionmentioning

confidence: 99%

“…PDF is a widely used document format in digital libraries because it can preserve the appearance of the original document. Although a good number of researches have been done to discover the document layout by converting the PDFs to other types of files (e.g., image, html, text) in the past two decades, automatically identifying the document logical structures information (e.g., words, text lines, paragraphs, etc) and extracting the document components (e.g., figures, tables, mathematical formulas, etc) as well as the content [2] are still a challenging problem. The major reasons are as follows: 1) the structural information is not explicitly marked up because of the un-tagged nature of PDF format; 2) the text sequences are often messily generated by the existing PDF-to-text tools; 3) new noises can be generated by some necessary tools (e.g., OCR), if converting the PDFs into other media (e.g., image).…”

Section: Introductionmentioning

confidence: 99%

Identifying table boundaries in digital documents via sparse line detection

Liu

Mitra

Giles

2008

Proceedings of the 17th ACM Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for

show abstract

“…Although many researchers analyze PDF documents by converting them to other formats (e.g., image, html), automatically identifying the PDF document logical structures information and document components (e.g., figures, tables, etc) are still challenging problems [2] because of three main reasons: 1) extracted texts from PDF files are non-tagged; 2) wrong text sequences are generated by the text extraction tools; 3) new noises are caused by necessary tools (e.g., OCR) when converting the PDFs into other format (e.g., image).…”

Section: Introductionmentioning

confidence: 99%