Extraction of bibliography information based on image of book cover

Yang, Hua; Onda, Norikazu; Kashimura, Masaaki; Ozawa, Shogo

doi:10.1109/iciap.1999.797713

Cited by 3 publications

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Extracting bibliographical data for PDF documents with HMM and external resources

Hsiao

Chang

Thomas

2014

Program

View full text Add to dashboard Cite

Purpose – The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs). Design/methodology/approach – The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata. Findings – Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified. Research limitations/implications – For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity. Practical implications – For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation. Originality/value – The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

show abstract

Extracting bibliographical data for PDF documents with HMM and external resources

Hsiao

Chang

Thomas

2014

Program

View full text Add to dashboard Cite

show abstract

Recognizing Call Numbers for Library Books: The Problem and Issues

Pham

Dvorak

Pérez

2018

2018 International Conference on Computational Science and Computational Intelligence (CSCI)

View full text Add to dashboard Cite

Complex document image segmentation using localized histogram analysis with multi-layer matching and clustering

Chen

Chiu

2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583)

View full text Add to dashboard Cite

This paper proposes a new segmentation method to separate the text ?om various complex document images. An automatic multilevel thresholding method, based on discriminant analysis, is utilized to recursively segment a specijkd block region into seve,ml layered image sub-blockr. Then the multi-loyer regiinbased clustering method is pe$ormed to process i,he layered image sub-blocks to form several object layers. Hence character strings with dferen t illuminations, nontext objects and background components ore segmented into separate object layers. After performed t i z t extraction process, the text objects with dyerent sizes, styles and illuminations are properly extracti?d Experimental results on the extraction of text strings fiom complex document images demonstrate the eflectivencss of the proposed region-based segmentation method.

show abstract

Extraction of bibliography information based on image of book cover

Cited by 3 publications

References 14 publications

Extracting bibliographical data for PDF documents with HMM and external resources

Extracting bibliographical data for PDF documents with HMM and external resources

Recognizing Call Numbers for Library Books: The Problem and Issues

Complex document image segmentation using localized histogram analysis with multi-layer matching and clustering

Contact Info

Product

Resources

About