The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2003
DOI: 10.1007/3-540-36618-0_5
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Classification of HTML Documents with WebClassII

Abstract: Abstract. This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
14
0

Year Published

2005
2005
2007
2007

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 26 publications
(16 citation statements)
references
References 6 publications
2
14
0
Order By: Relevance
“…All these results, which extend those reported in a previous work (Ceci & Malerba, 2003), are obtained by extensive experimentation on three datasets with category hierarchies of different complexity.…”
supporting
confidence: 86%
See 3 more Smart Citations
“…All these results, which extend those reported in a previous work (Ceci & Malerba, 2003), are obtained by extensive experimentation on three datasets with category hierarchies of different complexity.…”
supporting
confidence: 86%
“…This paper extends and revises the work by Ceci and Malerba (2003) on hierarchical text classification. The main extensions are: 1) the consideration of hierarchical feature sets in feature selection; 2) the improvement of the naïve Bayes algorithm to avoid problems related to the different document length (Kim, Rim, Yook, & Lim, 2002); 3) the validation of the proposed framework also for a probabilistic SVMbased classifier; 4) a new automated threshold selection algorithm that operates according to a bottom-up strategy, thus taking full advantage of the decisions made at lower levels of the hierarchy; 5) a more extensive experimentation.…”
Section: Introductionsupporting
confidence: 65%
See 2 more Smart Citations
“…A wide range of statistical and machine learning techniques have been applied to text categorization [3][4][5][6][7][8][9]. However, these techniques are all based on having some initial labeled examples, which are used to train a (semi)-supervised model.…”
Section: Task Definitionmentioning
confidence: 99%