In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernelbased methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form ''for document d i , category c 0 is preferred to category c 00 ''; this allows us to distinguish between primary and secondary categories not only in the
In recent years, more and more attention has been paid on
learning in structured domains, e.g. Chemistry. Both Neural Networks
and Kernel Methods for structured data have been proposed. Here, we
show that a recently developed technique for structured domains, i.e.
PCA for structures, permits to generate representations of graphs (specif-
ically, molecular graphs) which are quite effective when used for predic-
tion tasks (QSAR studies). The advantage of these representations is
that they can be generated automatically and exploited by any tradi-
tional predictor (e.g., Support Vector Regression with linear kernel)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.