Enhancing Table of Contents Extraction by System Aggregation

Nguyen, Tri-Thanh; Doucet, Antoine; Coustaty, Mickaël

doi:10.1109/icdar.2017.48

Cited by 8 publications

(4 citation statements)

References 14 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent algorithms have explored TOC extraction by parsing TOC pages and extract the hierarchical structure of sections and subsections. Most methods in this area have been developed in the context of the INEX [20] and ICDAR competitions [21][22][23] which, as we have mentioned before, focus on long and old digitised historical books, as opposed to short scientific articles with previous methods. To the best of our knowledge, the only work led outside these competitions on the topic of TOC page parsing is [24,25], who apply a rule-based approach to PDF document layout analysis.…”

Section: Toc Extraction Methodsmentioning

confidence: 99%

Automatic Table-of-Contents Generation for Efficient Information Access

et al. 2020

View full text Add to dashboard Cite

Purpose This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation. Methods Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth. Results We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community's interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments. Conclusions The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work.

show abstract

Section: Toc Extraction Methodsmentioning

confidence: 99%

Automatic Table-of-Contents Generation for Efficient Information Access

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The first one focuses on reconstructing only part of the structure information in one document, such as the table of contents (ToC) (Wu, Mitra, and Giles 2013;Tuarob, Mitra, and Giles 2015;Nguyen, Doucet, and Coustaty 2017;Bentabet et al 2020). Tuarob, Mitra, and Giles listed several position-related rules and used a random forest algorithm to determine whether one text line is a section name or not.…”

Section: Document Structure Reconstructionmentioning

confidence: 99%

HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures

et al. 2023

AAAI

View full text Add to dashboard Cite

The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary of each element in a single document page, neglecting the reconstruction of semantic structure in multi-page documents. This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields. To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Every document in HRDoc has line-level annotations including categories and relations obtained from rule-based extractors and human annotators. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.

show abstract

“…Although the existing OCR applications extract simple structures such as (Karpinski and Bela¨ıd, 2016) paragraphs and pages, it is harder to identify complex formatting such as chapters, sections. Mapping such information to a table of content to provide more structural information is proposed by (Nguyen et al, 2017). It introduced an aggregationbased method on two set operators and properties of the table of content entries.…”

Section: Derive Concept Relationsmentioning

confidence: 99%

A Review on Document Information Extraction Approaches

Silva¹,

Silva²

2021

Proceedings of the Student Research Workshop Associated With RANLP 2021

View full text Add to dashboard Cite

Information extraction from documents has become great use of novel natural language processing areas. Most of the entity extraction methodologies are variant in a context such as medical area, financial area, also come even limited to the given language. It is better to have one generic approach applicable for any document type to extract entity information regardless of language, context, and structure. Also, another issue in such research is structural analysis while keeping the hierarchical, semantic, and heuristic features. Another problem identified is that usually, it requires a massive training corpus. Therefore, this research focus on mitigating such barriers. Several approaches have been identifying towards building document information extractors focusing on different disciplines. This research area involves natural language processing, semantic analysis, information extraction, and conceptual modelling. This paper presents a review of the information extraction mechanism to construct a generic framework for document extraction with aim of providing a solid base for upcoming research.

show abstract

Enhancing Table of Contents Extraction by System Aggregation

Cited by 8 publications

References 14 publications

Automatic Table-of-Contents Generation for Efficient Information Access

Automatic Table-of-Contents Generation for Efficient Information Access

HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures

A Review on Document Information Extraction Approaches

Contact Info

Product

Resources

About