2009 10th International Conference on Document Analysis and Recognition 2009
DOI: 10.1109/icdar.2009.143
|View full text |Cite
|
Sign up to set email alerts
|

Analysis of book documents' table of content based on clustering

Abstract: Table of contents (TOC) recognition has attracted a great deal of attention in recent years. After reviewing the merits and drawbacks of the existing TOC recognition methods, we have observed that book documents are multi-page documents with intrinsic local format consistency. Based on this finding we introduce an automatic TOC analysis method through clustering. This method first detects the decorative elements in TOC pages. Then it learns a layout model used in the TOC pages through clustering. Finally, it g… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
15
0

Year Published

2010
2010
2021
2021

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 13 publications
(15 citation statements)
references
References 9 publications
0
15
0
Order By: Relevance
“…The most direct originates in work that grows out of the INEX and ICDAR book structure extraction com- petitions (Kazai et al, , 2010Doucet et al, 2011Doucet et al, , 2013, in which participants are challenged to recognize the fine-grained structure present in documents (recognizing, for example, that the current article has sections entitled "Abstract," "Introduction," "Data, "References," etc.). The most successful systems recognize structure by parsing the table of contents Meunier, 2005, 2009;Wu et al, 2013;Gao et al, 2009) rather than relying on the content of the book itself. Our work primarily differs in the fundamental design choice of prescribing a fix set of categories into which we classify pages (in order to enable comparison across documents) rather than prioritizing the idiosynractic structure of a book (which is useful for generating new tables of contents).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The most direct originates in work that grows out of the INEX and ICDAR book structure extraction com- petitions (Kazai et al, , 2010Doucet et al, 2011Doucet et al, , 2013, in which participants are challenged to recognize the fine-grained structure present in documents (recognizing, for example, that the current article has sections entitled "Abstract," "Introduction," "Data, "References," etc.). The most successful systems recognize structure by parsing the table of contents Meunier, 2005, 2009;Wu et al, 2013;Gao et al, 2009) rather than relying on the content of the book itself. Our work primarily differs in the fundamental design choice of prescribing a fix set of categories into which we classify pages (in order to enable comparison across documents) rather than prioritizing the idiosynractic structure of a book (which is useful for generating new tables of contents).…”
Section: Related Workmentioning
confidence: 99%
“…While other work has focused on extracting the idiosyncratic structure inherent in each book, such as recognizing chapter boundaries in order to automatically generate a table of contents, or link a parsed table of contents to positions in a book Meunier, 2005, 2009;Wu et al, 2013;Gao et al, 2009), labeling document segments with a fixed typology has complementary benefits, allowing researchers to identify consistent categories in all books regardless of the names assigned by a specific author or publisher, or popular at a given time. 1 At the same time, book structure labeling presents real challenges to automatic identification.…”
Section: Introductionmentioning
confidence: 99%
“…Accurate extraction of the table of content is a challenging task for book retrieval systems. ToC recognition was previously studied to enable inside book search and navigation [7,21]. However, they assume entries at the same level in a ToC share consistent features and each entry can be matched to the related title in the body part.…”
Section: Introductionmentioning
confidence: 99%
“…There has been an increase in interest in document analysis community to develop efficient algorithms for detecting ToC pages in a document [1]. The research related to ToC analysis can be divided into three areas [2]: ToC page detection, ToC parsing and to link the actual pages with these recognized parts.…”
mentioning
confidence: 99%
“…Belaïd [6] proposed a Part-of-Speech tagging (PoS) based algorithm for ToC detection. Gao et al [2] proposed a ToC parsing and decorative elements detection technique based on clustering. Mandal et al [7] used the spatial distribution properties of contents for detection of ToC pages.…”
mentioning
confidence: 99%