Automatic document classification and indexing in high-volume applications

Appiani, E.; Cesarini, Francesca; Colla, Anna Maria; Diligenti, Michelangelo; Gori, Marco; Marinai, Simone; Soda, Giovanni

doi:10.1007/pl00010904

Cited by 32 publications

(21 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, whereas RLSA filtering has proved its efficiency in segmenting textual documents, its use in graphics-rich documents is less frequent; one of the few methods we are aware of is that of Lu [13]. Other methods used for text-rich documents include those based on white streams [18] and the top-down methods using some kind of X-Y decomposition of the document [1,16].…”

Section: Introductionmentioning

confidence: 99%

Text/Graphics Separation Revisited

Tombre

Tabbone

Pélissier

et al. 2002

Lecture Notes in Computer Science

113

View full text Add to dashboard Cite

Abstract. Text/graphics separation aims at segmenting the document into two layers: a layer assumed to contain text and a layer containing graphical objects. In this paper, we present a consolidation of a method proposed by Fletcher and Kasturi, with a number of improvements to make it more suitable for graphics-rich documents. We discuss the right choice of thresholds for this method, and their stability. We also propose a post-processing step for retrieving text components touching the graphics, through local segmentation of the distance skeleton.

show abstract

Section: Introductionmentioning

confidence: 99%

Text/Graphics Separation Revisited

Tombre

Tabbone

Pélissier

et al. 2002

Lecture Notes in Computer Science

113

View full text Add to dashboard Cite

show abstract

“…The first one concerns data-based systems and the second one concerns model-based systems. Data-based systems are usually used in heterogeneous document flows and extract different information from documents, such as tables [26], graphical features such as logos and trademarks [27], or the general layout [23]. On the contrary, model-based systems are used in homogeneous document flows, where similar documents arrive generally one after the other [28][29][30][31].…”

Section: Document Image Similaritymentioning

confidence: 99%

“…The documents of the same class can share some interesting information such as the background color, the document layout, the position of the relevant information on the image, or metadata, such as the seller. Once one document is grouped into a class, an specific processing for extracting the desired information can be designed depending on these features [23]. A simple way to define a class consists in taking a representative image.…”

Section: Overviewmentioning

confidence: 99%

“…In the literature of document image classification, different measures of similarity have been used. Appiani et al [23] design a criterion to compare the structural similarity between trees that represent the structure of a document. Shin and Doermann [24] use a similarity measure that considers spatial and layout structure.…”

Section: Document Image Similaritymentioning

confidence: 99%

“…However, many applications require to deal with a great variety of layouts, where relevant information is located in different positions. In this case, it is necessary to recognize the document layout and apply the appropriate reading strategy [23]. Several strategies have been proposed to achieve an accurate document classification based on the layout analysis and classification [1,4,5,[23][24][25].…”

Section: Document Image Similaritymentioning

confidence: 99%

See 2 more Smart Citations

Tsallis Mutual Information for Document Classification

Vila¹,

Bardera²,

Feixas³

et al. 2011

Entropy

View full text Add to dashboard Cite

Mutual information is one of the mostly used measures for evaluating image similarity. In this paper, we investigate the application of three different Tsallis-based generalizations of mutual information to analyze the similarity between scanned documents. These three generalizations derive from the Kullback-Leibler distance, the difference between entropy and conditional entropy, and the Jensen-Tsallis divergence, respectively. In addition, the ratio between these measures and the Tsallis joint entropy is analyzed. The performance of all these measures is studied for different entropic indexes in the context of document classification and registration.

show abstract