Classifying web documents in a hierarchy of categories: a comprehensive study

Ceci, Michelangelo; Malerba, Donato

doi:10.1007/s10844-006-0003-2

Cited by 92 publications

(59 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In clustering sample sets, we add or delete sample A or B , and for fuzzy data sample set A , B , the convergence control function of fuzzy data closed loop operation and maintenance management in massive information is obtained under the control of attenuation constant 1 T and 2 T :…”

Section: Big Data Classification and High Dimensional Information Reomentioning

confidence: 99%

See 1 more Smart Citation

Big Data Analysis and Simulation of Distributed Marine Green Energy Resources Grid-Connected System

Tian

Huang

2017

Polish Maritime Research

View full text Add to dashboard Cite

In order to improve the working stability of distributed marine green energy resources grid-connected system, we need the big data information mining and fusion processing of grid-connected system and the information integration and recognition of distributed marine green energy grid-connected system based on big data analysis method, and improve the output performance of energy grid-connected system. This paper proposed a big data analysis method of distributed marine green energy resources grid-connected system based on closed-loop information fusion and auto correlation characteristic information mining. This method realized the big data closed-loop operation and maintenance management of grid-connected system, and built the big data information collection model of marine green energy resources grid-connected system, and reconstructs the feature space of the collected big data, and constructed the characteristic equation of fuzzy data closed-loop operation and maintenance management in convex spaces, and used the adaptive feature fusion method to achieve the auto correlation characteristics mining of big data operation and maintenance information, and improved the ability of information scheduling and information mining of distributed marine green energy resources grid-connected system. Simulation results show that using this method for the big data analysis of distributed marine green energy resources grid-connected system and using the multidimensional analysis technology of big data can improve the ability of information scheduling and information mining of distributed marine green energy resources grid-connected system, realizing the information optimization scheduling of grid-connected system. The output performance of grid connected system has been improved.

show abstract

Section: Big Data Classification and High Dimensional Information Reomentioning

confidence: 99%

“…Marine green energy resources are the renewable natural energy resources contained in the oceans, which is renewable and inexhaustible in the era of existence of solar system [1].…”

Section: Introductionmentioning

confidence: 99%

Big Data Analysis and Simulation of Distributed Marine Green Energy Resources Grid-Connected System

Tian

Huang

2017

Polish Maritime Research

View full text Add to dashboard Cite

show abstract

“…Optimization is made feasible by utilizing decomposition of the original problem and making incremental conditional gradient search in the subproblems. Ceci & Malerba (2007) present a comprehensive study on hierarchical classification of Web documents. They extend a previous work (Ceci & Malerba, 2003) considering hierarchical feature selection mechanisms, a naïve Bayes algorithm aimed at avoiding problems related to different document lengths, the validation of their framework for a probabilistic SVM-based classifier, and (iv) an automated threshold selection algorithm.…”

Section: Hierarchical Text Categorizationmentioning

confidence: 99%

Retrieving and Categorizing Bioinformatics Publications through a MultiAgent System

Addis¹,

Armano²,

Vargiu³

et al. 2011

Computational Biology and Applied Bioinformatics

View full text Add to dashboard Cite

In this chapter, we present PUB.MAS, a multiagent system able to retrieve and categorize bioinformatics publications from selected Web sources. The chapter extends and revises our previous work (Armano et al., 2007). The main extensions consist of a more detailed presentation of the information extraction task, a deep explanation of the adopted hierarchical text categorization technique, and the description of the prototype that has been implemented. Built upon X.MAS (Addis et al., 2008), a generic multiagent architecture aimed at retrieving, filtering and reorganizing information according to user interests, PUB.MAS is able to: (i) extract information from online digital archives; (ii) categorize publications according to a given taxonomy; and (iii) process user's feedback. As for information extraction, PUB.MAS provides specific wrappers able to extract publications from RSS-based Web pages and from Web Services. As for categorization, PUB.MAS performs Progressive Filtering (PF), the effective hierarchical text categorization technique described in (Addis et al., 2010). In its simplest setting, PF decomposes a given rooted taxonomy into pipelines, one for each existing path between the root and each node of the taxonomy, so that each pipeline can be tuned in isolation. To this end, a threshold selection algorithm has been devised, aimed at finding a sub-optimal combination of thresholds for each pipeline. PUB.MAS provides also suitable strategies to allow users to express what they are really interested in and to personalize search results accordingly. Moreover, PUB.MAS provides a straightforward approach to user feedback with the goal of improving the performance of the system depending on user needs and preferences. The prototype allows users to set the sources from which publications will be extracted and the topics s/he is interested in. As for the digital archives, the user can choose between BMC Bioinformatics and PubMed Central. As for the topics of interest, the user can select one or more categories from the adopted taxonomy, which is taken from the TAMBIS ontology (Baker et al., 1999). The overall task begins with agents able to handle the selected digital archives, which extract the candidate publications. Then, all agents that embody a classifier trained on the selected topics are involved to perform text categorization. Finally, the system supplies the user with the selected publications through suitable interface agents. The chapter is organized as follows. First, we give a brief survey of relevant related work on: (i) scientific publication retrieval; (ii) hierarchical text categorization; and (iii) multiagent systems in information retrieval. Subsequently, we concentrate on the task of retrieving and categorizing bioinformatics publications. Then, PUB.MAS is illustrated together with its performances and the implemented prototype. Conclusions end the chapter.

show abstract

“…Some other researchers (Ceci and Malerba 2007;Sun and Lim 2001) have proposed that evaluation measures specific to the hierarchical case should be used in HTC, so that credit is given to ''partially correct'' classification, i.e., to the misclassification of a document into a category topologically close to the correct one. We think that these measures are difficult to judge in the abstract, since whether a user would gain any more benefit from a ''partially correct'' classification than from a ''completely wrong'' classification remains open to question, and fundamentally dependent on the particular application.…”

Section: Related Workmentioning

confidence: 99%

Boosting multi-label hierarchical text categorization

2008

View full text Add to dashboard Cite

Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for ''flat'' classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TREEBOOST.MH, a multi-label HTC algorithm consisting of a hierarchical variant of ADABOOST.MH, a very well-known member of the family of ''boosting'' learning algorithms. TREEBOOST.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed ''locally'', i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated ''locally''. All these intuitions are embodied within TREEBOOST.MH in an elegant and simple way, i.e. by defining TREEBOOST.MH as a recursive algorithm that uses ADABOOST.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TREEBOOST.MH on three HTC benchmarks, and discuss analytically its computational cost.

show abstract

Classifying web documents in a hierarchy of categories: a comprehensive study

Cited by 92 publications

References 37 publications

Big Data Analysis and Simulation of Distributed Marine Green Energy Resources Grid-Connected System

Big Data Analysis and Simulation of Distributed Marine Green Energy Resources Grid-Connected System

Retrieving and Categorizing Bioinformatics Publications through a MultiAgent System

Boosting multi-label hierarchical text categorization

Contact Info

Product

Resources

About