Hierarchical Classification of HTML Documents with WebClassII

Ceci, Michelangelo; Malerba, Donato

doi:10.1007/3-540-36618-0_5

Cited by 26 publications

(16 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All these results, which extend those reported in a previous work (Ceci & Malerba, 2003), are obtained by extensive experimentation on three datasets with category hierarchies of different complexity.…”

supporting

confidence: 86%

“…This paper extends and revises the work by Ceci and Malerba (2003) on hierarchical text classification. The main extensions are: 1) the consideration of hierarchical feature sets in feature selection; 2) the improvement of the naïve Bayes algorithm to avoid problems related to the different document length (Kim, Rim, Yook, & Lim, 2002); 3) the validation of the proposed framework also for a probabilistic SVMbased classifier; 4) a new automated threshold selection algorithm that operates according to a bottom-up strategy, thus taking full advantage of the decisions made at lower levels of the hierarchy; 5) a more extensive experimentation.…”

Section: Introductionsupporting

confidence: 65%

“…In a previous work (Ceci & Malerba, 2003) a top-down approach was proposed, which suffered from two limitations:…”

Section: Automated Threshold Determinationmentioning

confidence: 99%

“…1), which include documents of a category (positive examples) and documents of the sibling categories (negative examples), are not considered for two reasons. First, in Ceci and Malerba (2003) we have already showed that hierarchical training sets perform better than proper training sets. Second, when no training document is associated to internal categories, as in the case of some datasets considered in this work, proper training sets cannot be used, since it would be impossible to build a classifier.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Classifying web documents in a hierarchy of categories: a comprehensive study

Ceci

Malerba

2007

J Intell Inf Syst

Self Cite

View full text Add to dashboard Cite

supporting

confidence: 86%

Section: Introductionsupporting

confidence: 65%

“…In a previous work (Ceci & Malerba, 2003) a top-down approach was proposed, which suffered from two limitations:…”

Section: Automated Threshold Determinationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Classifying web documents in a hierarchy of categories: a comprehensive study

Ceci

Malerba

2007

J Intell Inf Syst

Self Cite

View full text Add to dashboard Cite

“…A wide range of statistical and machine learning techniques have been applied to text categorization [3][4][5][6][7][8][9]. However, these techniques are all based on having some initial labeled examples, which are used to train a (semi)-supervised model.…”

Section: Task Definitionmentioning

confidence: 99%

Helping Physicians to Organize Guidelines Within Conceptual Hierarchies

Sona¹,

Avesani²,

Moskovitch

2005

Artificial Intelligence in Medicine

View full text Add to dashboard Cite

Abstract. Clinical Practice Guidelines (CPGs) are increasingly common in clinical medicine for prescribing a set of rules that a physician should follow. Recent interest is in accurate retrieval of CPGs at the point of care. Examples are the CPGs digital libraries National Guideline Clearinghouse (NGC) or Vaidurya, which are organized along predefined concept hierarchies. In this case, both browsing and concept-based search can be applied. However, mandatory step in enabling both ways to CPGs retrieval is manual classification of CPGs along the concepts hierarchy, which is extremely time consuming. Supervised learning approaches are usually not satisfying, since commonly too few or no CPGs are provided as training set for each class. In this paper we apply TaxSOM for multiple classification. TaxSOM is an unsupervised model that supports the physician in the classification of CPGs along the concepts hierarchy, even when no labeled examples are available. This model exploits lexical and topological information on the hierarchy to elaborate a classification hypothesis for any given CPG. We argue that such a kind of unsupervised classification can support a physician to classify CPGs by recommending the most probable classes. An experimental evaluation on various concept hierarchies with hundreds of CPGs and categories provides the empirical evidence of the proposed technique.

show abstract

Importance of HTML Structural Elements and Metadata in Automated Subject Classification

Golub

Ardö

2005

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

Abstract. The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.

show abstract

Hierarchical Classification of HTML Documents with WebClassII

Cited by 26 publications

References 6 publications

Classifying web documents in a hierarchy of categories: a comprehensive study

Classifying web documents in a hierarchy of categories: a comprehensive study

Helping Physicians to Organize Guidelines Within Conceptual Hierarchies

Importance of HTML Structural Elements and Metadata in Automated Subject Classification

Contact Info

Product

Resources

About