Book Layout Analysis: TOC Structure Extraction Engine

Dresevic, Bodin; Uzelac, Aleksandar; Radakovic, Bogdan; Todic, Nikola

doi:10.1007/978-3-642-03761-0_17

Cited by 14 publications

(7 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent algorithms have explored TOC extraction by parsing TOC pages and extract the hierarchical structure of sections and subsections. Most methods in this area have been developed in the context of the INEX [20] and ICDAR competitions [21][22][23] which, as we have mentioned before, focus on long and old digitised historical books, as opposed to short scientific articles with previous methods. To the best of our knowledge, the only work led outside these competitions on the topic of TOC page parsing is [24,25], who apply a rule-based approach to PDF document layout analysis.…”

Section: Toc Extraction Methodsmentioning

confidence: 99%

Automatic Table-of-Contents Generation for Efficient Information Access

et al. 2020

View full text Add to dashboard Cite

Purpose This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation. Methods Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth. Results We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community's interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments. Conclusions The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work.

show abstract

Section: Toc Extraction Methodsmentioning

confidence: 99%

Automatic Table-of-Contents Generation for Efficient Information Access

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Most of the approaches of the state of the art rely on the detection of ToC pages within the book, and their detailed analysis for listing all ToC entries and linking them to the corresponding pages. To extract ToC entries and link them to the right page, the most effective technique to date remains the one developed by Dresevic et al [13] which consists in recognizing ToC pages and then processing them so as to extract all ToC entries using a supervised method relying on pattern occurrences from an external training set. However, it is worth noting that the approach of Gander et al [16] performs better for the sole ToC entry extraction (not taking page-linking into account).…”

Section: Approaches Presentedmentioning

confidence: 99%

Logical Structure Extraction from Digitized Books

Doucet¹

2018

Series in Machine Perception and Artificial Intelligence

View full text Add to dashboard Cite

Mass digitization projects, such as the Million Book Project, efforts of the Open Content Alliance, and the digitization work of Google, are converting whole libraries by digitizing books on an industrial scale [5]. The process involves the efficient photographing of books, page-by-page, and the conversion of the image of each page into searchable text through the use of optical character recognition (OCR) software.Current digitization and OCR technologies typically produce the full text of digitized books with only minimal structure information. Pages and paragraphs are usually identified and marked up in the OCR, but more sophisticated structures, such as chapters, sections, etc., are not recognized. In order to enable systems to provide users with richer browsing experiences, it is necessary to make such additional structures available, for example, in the form of XML markup embedded in the full text of the digitized books.The Book Structure Extraction competition aims to address this need by promoting research into automatic structure recognition and extraction techniques that could complement or enhance current OCR methods and Document Analysis and Text Recognition Downloaded from www.worldscientific.com by UNIVERSITY OF HELSINKI on 11/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

show abstract

“…The state-of-the-art approach belongs to this type and is developed by Dresevic et al [5](MDCS). It also recognises TOC pages and assign each physical page with a logical page number.…”

Section: Related Workmentioning

confidence: 99%

Enhancing Table of Contents Extraction by System Aggregation

Nguyen

Doucet

Coustaty

2017

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

The OCR-ed books usually lack logical structure information, such as chapters, sections. To enrich the navigation experience of users, several approaches have been proposed to extract table of contents (ToC) from digitised books. In this paper, we introduce an aggregation-based method to enhance ToC extraction using system submissions from the ICDAR Book structure extraction competitions (2009, 2011, and 2013). Our experimental results show that the union of two best approaches outperforms the existing approaches using both the title-based and link-based evaluation measures on a dataset of more than 2000 books. By efficiently combining the results of existing systems in an unsupervised way, we consistently beat the state-of-the-art in book structure extraction, with performance improvements that are statistically significant.

show abstract

Book Layout Analysis: TOC Structure Extraction Engine

Cited by 14 publications

References 1 publication

Automatic Table-of-Contents Generation for Efficient Information Access

Automatic Table-of-Contents Generation for Efficient Information Access

Logical Structure Extraction from Digitized Books

Enhancing Table of Contents Extraction by System Aggregation

Contact Info

Product

Resources

About