Abstract:Abstract. In this paper, the problem of classifying a HTML documents into a hierarchy of categories is investigated in the context of cooperative information repository, named WebClassII. The hierarchy of categories is involved in all aspects of automated document classification, namely feature extraction, learning, and classification of a new document. Innovative aspects of this work are: a) an experimental study on actual Web documents which can be associated to any node in the hierarchy; b) the feature sele… Show more
“…By assuming that documents to be rejected have a low posterior probability for all categories, the problem can be reformulated in a different way, namely, how to define a threshold for the value taken by a naïve classifier. Details on the thresholding algorithm are reported in [5].…”
Section: The Classification Methodsmentioning
confidence: 99%
“…More precisely, this results from a tight integration of the system WISDOM++, which performs document understanding on the basis of geometrical information, with the content-based classification capabilities provided by the system WebClassII [4]. WebClassII is a client-server application that performs the automated classification of Web pages on the basis of their textual content.…”
Abstract. In this paper we present an integrated approach for semantic structure extraction in document images. Document images are initially processed to extract both their layout and logical structures on the base of geometrical and spatial information. Then, textual content of logical components is employed for automatic semantic labeling of layout structures. To support the whole process different machine learning techniques are applied. Experimental results on a set of biomedical multi-page documents are discussed and future directions are drawn.
“…By assuming that documents to be rejected have a low posterior probability for all categories, the problem can be reformulated in a different way, namely, how to define a threshold for the value taken by a naïve classifier. Details on the thresholding algorithm are reported in [5].…”
Section: The Classification Methodsmentioning
confidence: 99%
“…More precisely, this results from a tight integration of the system WISDOM++, which performs document understanding on the basis of geometrical information, with the content-based classification capabilities provided by the system WebClassII [4]. WebClassII is a client-server application that performs the automated classification of Web pages on the basis of their textual content.…”
Abstract. In this paper we present an integrated approach for semantic structure extraction in document images. Document images are initially processed to extract both their layout and logical structures on the base of geometrical and spatial information. Then, textual content of logical components is employed for automatic semantic labeling of layout structures. To support the whole process different machine learning techniques are applied. Experimental results on a set of biomedical multi-page documents are discussed and future directions are drawn.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.