Automatically categorizing documents into pre-defined topic hierarchies or taxonomies is a crucial step in knowledge and content management. Standard machine learning techniques like Support Vector Machines and related large margin methods have been successfully applied for this task, albeit the fact that they ignore the inter-class relationships. In this paper, we propose a novel hierarchical classification method that generalizes Support Vector Machine learning and that is based on discriminant functions that are structured in a way that mirrors the class hierarchy. Our method can work with arbitrary, not necessarily singly connected taxonomies and can deal with task-specific loss functions. All parameters are learned jointly by optimizing a common objective function corresponding to a regularized upper bound on the empirical loss. We present experimental results on the WIPO-alpha patent collection to show the competitiveness of our approach.
The capsular polysaccharide (CPS) synthesis locus of 13 Streptococcus suis serotypes (serotype 1, 3, 4, 5, 7, 8, 9, 10, 14, 19, 23, 25 and 1/2) was sequenced and compared with that of serotype 2 and 16. The CPS synthesis locus of these 15 serotypes falls into two genetic groups. The locus is located on the chromosome between orfZ and aroA. All the translated proteins in the CPS synthesis locus were clustered into 127 homology groups using the tribemcl algorithm. The general organization of the locus suggested that the CPS of S. suis could be synthesized by the Wzy-dependent pathway. The capsule of serotypes 3, 4, 5, 7, 9, 10, 19 and 23 was predicted to be amino-polysaccharide. Sialic acid was predicted to be present in the capsule of serotypes 1, 2, 14, 16 and 1/2. The characteristics of the CPS synthesis locus suggest that some genes may have been imported into S. suis (or their ancestors) on multiple occasions from different and unknown sources.
Term-based representations of documents have found widespread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to variations in word usage. In this paper we investigate the use of concept-based document representations to supplement word-or phrase-based features. The utilized concepts are automatically extracted from documents via probabilistic latent semantic analysis. We propose to use AdaBoost to optimally combine weak hypotheses based on both types of features. Experimental results on standard benchmarks confirm the validity of our approach, showing that AdaBoost achieves consistent improvements by including additional semantic features in the learned ensemble.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.