2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 2017
DOI: 10.1109/icdar.2017.48
|View full text |Cite
|
Sign up to set email alerts
|

Enhancing Table of Contents Extraction by System Aggregation

Abstract: The OCR-ed books usually lack logical structure information, such as chapters, sections. To enrich the navigation experience of users, several approaches have been proposed to extract table of contents (ToC) from digitised books. In this paper, we introduce an aggregation-based method to enhance ToC extraction using system submissions from the ICDAR Book structure extraction competitions (2009, 2011, and 2013). Our experimental results show that the union of two best approaches outperforms the existing approac… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 14 publications
(19 reference statements)
0
3
0
Order By: Relevance
“…Recent algorithms have explored TOC extraction by parsing TOC pages and extract the hierarchical structure of sections and subsections. Most methods in this area have been developed in the context of the INEX [20] and ICDAR competitions [21][22][23] which, as we have mentioned before, focus on long and old digitised historical books, as opposed to short scientific articles with previous methods. To the best of our knowledge, the only work led outside these competitions on the topic of TOC page parsing is [24,25], who apply a rule-based approach to PDF document layout analysis.…”
Section: Toc Extraction Methodsmentioning
confidence: 99%
“…Recent algorithms have explored TOC extraction by parsing TOC pages and extract the hierarchical structure of sections and subsections. Most methods in this area have been developed in the context of the INEX [20] and ICDAR competitions [21][22][23] which, as we have mentioned before, focus on long and old digitised historical books, as opposed to short scientific articles with previous methods. To the best of our knowledge, the only work led outside these competitions on the topic of TOC page parsing is [24,25], who apply a rule-based approach to PDF document layout analysis.…”
Section: Toc Extraction Methodsmentioning
confidence: 99%
“…The first one focuses on reconstructing only part of the structure information in one document, such as the table of contents (ToC) (Wu, Mitra, and Giles 2013;Tuarob, Mitra, and Giles 2015;Nguyen, Doucet, and Coustaty 2017;Bentabet et al 2020). Tuarob, Mitra, and Giles listed several position-related rules and used a random forest algorithm to determine whether one text line is a section name or not.…”
Section: Document Structure Reconstructionmentioning
confidence: 99%
“…Although the existing OCR applications extract simple structures such as (Karpinski and Bela¨ıd, 2016) paragraphs and pages, it is harder to identify complex formatting such as chapters, sections. Mapping such information to a table of content to provide more structural information is proposed by (Nguyen et al, 2017). It introduced an aggregationbased method on two set operators and properties of the table of content entries.…”
Section: Derive Concept Relationsmentioning
confidence: 99%