“…PDF is a widely used document format in digital libraries because it can preserve the appearance of the original document. Although a good number of researches have been done to discover the document layout by converting the PDFs to other types of files (e.g., image, html, text) in the past two decades, automatically identifying the document logical structures information (e.g., words, text lines, paragraphs, etc) and extracting the document components (e.g., figures, tables, mathematical formulas, etc) as well as the content [2] are still a challenging problem. The major reasons are as follows: 1) the structural information is not explicitly marked up because of the un-tagged nature of PDF format; 2) the text sequences are often messily generated by the existing PDF-to-text tools; 3) new noises can be generated by some necessary tools (e.g., OCR), if converting the PDFs into other media (e.g., image).…”