Abstract:Abstract. This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the… Show more
“…All these results, which extend those reported in a previous work (Ceci & Malerba, 2003), are obtained by extensive experimentation on three datasets with category hierarchies of different complexity.…”
supporting
confidence: 86%
“…This paper extends and revises the work by Ceci and Malerba (2003) on hierarchical text classification. The main extensions are: 1) the consideration of hierarchical feature sets in feature selection; 2) the improvement of the naïve Bayes algorithm to avoid problems related to the different document length (Kim, Rim, Yook, & Lim, 2002); 3) the validation of the proposed framework also for a probabilistic SVMbased classifier; 4) a new automated threshold selection algorithm that operates according to a bottom-up strategy, thus taking full advantage of the decisions made at lower levels of the hierarchy; 5) a more extensive experimentation.…”
Section: Introductionsupporting
confidence: 65%
“…In a previous work (Ceci & Malerba, 2003) a top-down approach was proposed, which suffered from two limitations:…”
“…1), which include documents of a category (positive examples) and documents of the sibling categories (negative examples), are not considered for two reasons. First, in Ceci and Malerba (2003) we have already showed that hierarchical training sets perform better than proper training sets. Second, when no training document is associated to internal categories, as in the case of some datasets considered in this work, proper training sets cannot be used, since it would be impossible to build a classifier.…”
“…All these results, which extend those reported in a previous work (Ceci & Malerba, 2003), are obtained by extensive experimentation on three datasets with category hierarchies of different complexity.…”
supporting
confidence: 86%
“…This paper extends and revises the work by Ceci and Malerba (2003) on hierarchical text classification. The main extensions are: 1) the consideration of hierarchical feature sets in feature selection; 2) the improvement of the naïve Bayes algorithm to avoid problems related to the different document length (Kim, Rim, Yook, & Lim, 2002); 3) the validation of the proposed framework also for a probabilistic SVMbased classifier; 4) a new automated threshold selection algorithm that operates according to a bottom-up strategy, thus taking full advantage of the decisions made at lower levels of the hierarchy; 5) a more extensive experimentation.…”
Section: Introductionsupporting
confidence: 65%
“…In a previous work (Ceci & Malerba, 2003) a top-down approach was proposed, which suffered from two limitations:…”
“…1), which include documents of a category (positive examples) and documents of the sibling categories (negative examples), are not considered for two reasons. First, in Ceci and Malerba (2003) we have already showed that hierarchical training sets perform better than proper training sets. Second, when no training document is associated to internal categories, as in the case of some datasets considered in this work, proper training sets cannot be used, since it would be impossible to build a classifier.…”
“…A wide range of statistical and machine learning techniques have been applied to text categorization [3][4][5][6][7][8][9]. However, these techniques are all based on having some initial labeled examples, which are used to train a (semi)-supervised model.…”
Abstract. Clinical Practice Guidelines (CPGs) are increasingly common in clinical medicine for prescribing a set of rules that a physician should follow. Recent interest is in accurate retrieval of CPGs at the point of care. Examples are the CPGs digital libraries National Guideline Clearinghouse (NGC) or Vaidurya, which are organized along predefined concept hierarchies. In this case, both browsing and concept-based search can be applied. However, mandatory step in enabling both ways to CPGs retrieval is manual classification of CPGs along the concepts hierarchy, which is extremely time consuming. Supervised learning approaches are usually not satisfying, since commonly too few or no CPGs are provided as training set for each class. In this paper we apply TaxSOM for multiple classification. TaxSOM is an unsupervised model that supports the physician in the classification of CPGs along the concepts hierarchy, even when no labeled examples are available. This model exploits lexical and topological information on the hierarchy to elaborate a classification hypothesis for any given CPG. We argue that such a kind of unsupervised classification can support a physician to classify CPGs by recommending the most probable classes. An experimental evaluation on various concept hierarchies with hundreds of CPGs and categories provides the empirical evidence of the proposed technique.
Abstract. The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.