ICDAR 2013 Competition on Book Structure Extraction

Doucet, Antoine; Kazai, Gabriella; Colutto, Sebastian; Mühlberger, Günter

doi:10.1109/icdar.2013.290

Cited by 19 publications

(22 citation statements)

References 15 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Recent algorithms have explored TOC extraction by parsing TOC pages and extract the hierarchical structure of sections and subsections. Most methods in this area have been developed in the context of the INEX [20] and ICDAR competitions [21][22][23] which, as we have mentioned before, focus on long and old digitised historical books, as opposed to short scientific articles with previous methods. To the best of our knowledge, the only work led outside these competitions on the topic of TOC page parsing is [24,25], who apply a rule-based approach to PDF document layout analysis.…”

Section: Toc Extraction Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Table-of-Contents Generation for Efficient Information Access

et al. 2020

View full text Add to dashboard Cite

Purpose This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation. Methods Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth. Results We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community's interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments. Conclusions The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work.

show abstract

Section: Toc Extraction Methodsmentioning

confidence: 99%

“…Lastly, a number of methods have been proposed to detect titles using machine learning methods based on layout and text features. In such approaches, the list of titles are hierarchically ordered according to a predefined rule-based function [21,26,27].…”

Section: Toc Extraction Methodsmentioning

confidence: 99%

Automatic Table-of-Contents Generation for Efficient Information Access

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Several approaches are meant to address the extraction of books' ToCs. They can be classified into 3 types, including approaches based on the detection of ToC pages, on the whole book content, and hybrid ones [4].…”

Section: Related Workmentioning

confidence: 99%

“…In this paper, we present an approach based on the aggregation of the existing approaches. We utilise the combination of two set operators (the union and the intersection) and two properties (title and page number) to aggregate submissions of the ICDAR book structure extraction competitions in 2009 [2], 2011 [3], and 2013 [4]. Our method is evaluated by the title-based and link-based measures over three book structure extraction competitions' datasets.…”

Section: Introductionmentioning

confidence: 99%

Enhancing Table of Contents Extraction by System Aggregation

Nguyen

Doucet

Coustaty

2017

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

Self Cite

View full text Add to dashboard Cite

The OCR-ed books usually lack logical structure information, such as chapters, sections. To enrich the navigation experience of users, several approaches have been proposed to extract table of contents (ToC) from digitised books. In this paper, we introduce an aggregation-based method to enhance ToC extraction using system submissions from the ICDAR Book structure extraction competitions (2009, 2011, and 2013). Our experimental results show that the union of two best approaches outperforms the existing approaches using both the title-based and link-based evaluation measures on a dataset of more than 2000 books. By efficiently combining the results of existing systems in an unsupervised way, we consistently beat the state-of-the-art in book structure extraction, with performance improvements that are statistically significant.

show abstract

“…In addition to these task, the Structure Extraction (SE) task ran at ICDAR 2013 [3], with the aim of evaluating automatic techniques for deriving structure from OCR and building hyperlinked table of contents. The extracted structure could then be used to aid navigation inside the books.…”

Section: Aims and Tasksmentioning

confidence: 99%

Report on INEX 2013

Bellot¹,

Doucet²,

Geva³

et al. 2013

SIGIR Forum

View full text Add to dashboard Cite

International audienceINEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2013 evaluation campaign, which consisted of four activities addressing three themes: searching professional and user generated data (Social Book Search track); searching structured or semantic data (Linked Data track); and focused retrieval (Snippet Retrieval and Tweet Contextualization tracks). INEX 2013 was an exciting year for INEX in which we consolidated the collaboration with (other activities in) CLEF and for the second time ran our workshop as part of the CLEF labs in order to facilitate knowledge transfer between the evaluation forums. This paper gives an overview of all the INEX 2013 tracks, their aims and task, the built test-collections, and gives an initial analysis of the results

show abstract

ICDAR 2013 Competition on Book Structure Extraction

Cited by 19 publications

References 15 publications

Automatic Table-of-Contents Generation for Efficient Information Access

Automatic Table-of-Contents Generation for Efficient Information Access

Enhancing Table of Contents Extraction by System Aggregation

Report on INEX 2013

Contact Info

Product

Resources

About