Automated detection and segmentation of table of contents page from document images

Mandal, Sekhar; Chowdhury, Shyama Prosad; Das, Amit Kumar; Chanda, Bhabatosh

doi:10.1109/icdar.2003.1227697

Cited by 19 publications

(6 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar with the research of Tsuruoka et al (2001) and Mandal et al (2003) in English documents, Sun et al (2004) made use of the rules of indentation in TOCs of Chinese books, and developed an algorithm to digitalize TOCs of Chinese books based on OCR technology and indentation analysis. Gao et al (2010) noticed the style consistence phenomenon of TOCs of Chinese books, and put forward a Chinese book TOC recognition method by detecting decorative elements in the TOC based on clustering techniques.…”

Section: Review Of Related Literaturementioning

confidence: 97%

A method for automatic analysis Table of Contents in Chinese books

Chen

2015

Library Hi Tech

View full text Add to dashboard Cite

Purpose – The purpose of this paper is to propose a novel method to analyze Table of Contents (TOC) in Chinese books automatically based on the hierarchy organization rules which gained by investigation. Design/methodology/approach – This paper analyzed the main literature in this field first, then hierarchy organization rules of Chinese book TOC were generated and the method parsing TOC automatically based on these rules was proposed. A prototype system implementing the method was also developed. The method was evaluated through processing a corpus on the prototype system, and the results were checked with calculation of precision and recall. Findings – The experiment result illustrated the superiority (extensive application, recall is 95.34 percent and precision is 94.44 percent) of the method. Practical implications – The result can help Chinese libraries deal with electronic texts from four aspects. First, it can be used to complement or enhance current digitization and optical character recognition methods and cut the financial and labor cost of Chinese libraries. Second, it can help libraries to keep information on indexing words as well as chapters, sections and subsections in Chinese book databases, which ensures easy retrieval and extract any intended portion as demanded by user. Third, it helps to enrich the services and then enhances the user experiences in Chinese libraries. Fourth, it improves the specification and policy of digitalizing Chinese books. Originality/value – The paper provided insight into the hierarchy organization of TOCs in Chinese books, the method based on the rules has extensive application than other methods. This method for Chinese book TOC automatic analysis is also as reference for English book TOC automatic analysis.

show abstract

Section: Review Of Related Literaturementioning

confidence: 97%

A method for automatic analysis Table of Contents in Chinese books

Chen

2015

Library Hi Tech

View full text Add to dashboard Cite

show abstract

“…And according to the way the models are generated, those approaches can be classified into two types: rule-based and learning-based. For example, Mandal et al [1] proposed a method of detecting TOC pages in a document, relying on page number-related heuristics and working on page images. Tsuruoka et al [2] used the indentation and font size to extract structural elements such as chapters and sections in a book.…”

Section: Related Workmentioning

confidence: 99%

“…Previous works have mentioned very little on how to get the individual TOC entries from TOC pages or how to segment TOC pages into TOC entries. Only Mandal et al [1] proposed a method to process broken-in lines based on predefined TOC component models, which cannot adapt to various TOC styles. In this paper, we use clustering techniques to generate a matched TOC model based on "document intrinsic format consistency".…”

Section: Toc Parsingmentioning

confidence: 99%

Analysis of book documents' table of content based on clustering

Gao

Tang

Lin

et al. 2009

2009 10th International Conference on Document Analysis and Recognition

View full text Add to dashboard Cite

Table of contents (TOC) recognition has attracted a great deal of attention in recent years. After reviewing the merits and drawbacks of the existing TOC recognition methods, we have observed that book documents are multi-page documents with intrinsic local format consistency. Based on this finding we introduce an automatic TOC analysis method through clustering. This method first detects the decorative elements in TOC pages. Then it learns a layout model used in the TOC pages through clustering. Finally, it generates TOC entries and extracts their hierarchical structure under the guidance of the model. More specifically, broken lines are taken into account in the method. Experimental results show that this method achieves high accuracy and efficiency. In addition, this method has been successfully applied in a commercial E-book production software package.

show abstract

“…Mandal et al [1] proposed to extract the TOC from the scanned documents. Their approach is primarily based on optical character recognition (OCR), page heuristics and related techniques.…”

Section: Related Workmentioning

confidence: 99%