Abstract. In this paper we present an integrated approach for semantic structure extraction in document images. Document images are initially processed to extract both their layout and logical structures on the base of geometrical and spatial information. Then, textual content of logical components is employed for automatic semantic labeling of layout structures. To support the whole process different machine learning techniques are applied. Experimental results on a set of biomedical multi-page documents are discussed and future directions are drawn.
Abstract. In this paper, the problem of classifying a HTML documents into a hierarchy of categories is investigated in the context of cooperative information repository, named WebClassII. The hierarchy of categories is involved in all aspects of automated document classification, namely feature extraction, learning, and classification of a new document. Innovative aspects of this work are: a) an experimental study on actual Web documents which can be associated to any node in the hierarchy; b) the feature selection process; c) the automated selection of thresholds for the score returned by a classifier; d) the comparison of three different techniques (fiat, hierarchical with proper training sets, hierarchical with hierarchical training sets); e) the definition of new measures for the evaluation of system performances. Results show that the use of hierarchical training sets improves the hierarchical techniques.
Finding disease relationships requires laborious examination of hundreds of possible candidate heterogeneous factors. Much of the related information is currently contained in biological and medical journals, making biomedical text mining a central bioinformatic problem. More than 14 million abstracts of such papers are contained in the Medline collection and are available online. In this paper we present a data mining engine, namely MeSH Terms Associator (MTA), that has been employed in a distributed architecture to refine a generic PubMed query by means of discovery of concept relations in the form of association rules. However, the number of discovered association rules is usually high and the interest of most of them does not fulfil user expectations. In addition, the presentation of thousands of rules can discourage users from interpreting them. To overcome this problem we investigate the application of some filtering techniques. Experimental results on datasets corresponding to real-world biomedical queries are discussed and future directions are drawn.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.