Support vector machines classification with a very large-scale taxonomy

Liu, Tie-Yan; Yang, Yiming; Wan, Hao; Zeng, Hua-Jun; Chen, Zheng; Ma, Wei‐Ying

doi:10.1145/1089815.1089821

Cited by 181 publications

(181 citation statements)

References 16 publications

Supporting

Mentioning

178

Contrasting

Unclassified

Order By: Relevance

“…The true Yahoo! directory structure contains thousands of labels and is a very difficult classification problem that traditional classification methods fail to adequately handle (Liu et al 2005). However, the majority of multi-label research conducted using the Yahoo!…”

Section: Background and Motivationmentioning

confidence: 99%

“…In contrast to the datasets typically utilized in research, multilabel corpora in the real world can contain thousands or tens of thousands of labels, and the label frequencies in these datasets tend to have highly skewed frequency-distributions with power-law statistics (Yang et al 2003;Liu et al 2005;Dekel and Shamir 2010). Figure 1 illustrates this point for three large real-world corpora-each containing thousands of unique labels-by plotting the number of labels within each corpus as a function of label-frequency.…”

Section: Background and Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Statistical topic models for multi-label document classification

et al. 2011

View full text Add to dashboard Cite

Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.

show abstract

Section: Background and Motivationmentioning

confidence: 99%

Section: Background and Motivationmentioning

confidence: 99%

Statistical topic models for multi-label document classification

et al. 2011

View full text Add to dashboard Cite

show abstract

“…However, the work by [6] is among the pioneering in hierarchical classification towards addressing Web-scale directories such as Yahoo! directory consisting of over 100,000 target classes.…”

Section: Other Related Workmentioning

confidence: 99%

“…However, these approaches lead to multiple folds increase in training time as shown in [9]. Prediction speed also suffers by employing excessive flattening as studied in the work by [6] showing that the space complexity of a flat classifier is much higher than a hierarchical model. Moreover, for predicting an unseen test instance in a K class problem, one needs to evaluate O(K) classifiers in flat classification as against O(log K) classifiers in a top-down manner.…”

Section: Problem Setupmentioning

confidence: 99%

Maximum-Margin Framework for Training Data Synchronization in Large-Scale Hierarchical Classification

Babbar¹,

Partalas²,

Gaussier³

et al. 2013

Neural Information Processing

View full text Add to dashboard Cite

Abstract. In the context of supervised learning, the training data for large-scale hierarchical classification consist of (i) a set of input-output pairs, and (ii) a hierarchy structure defining parent-child relation among class labels. It is often the case that the hierarchy structure given apriori is not optimal for achieving high classification accuracy. This is especially true for large web-taxonomies such as Yahoo! directory which consist of tens of thousand of classes, and also the fact that an important goal of hierarchy design is to render better navigability and browsing. In this work, we propose a maximum-margin framework for automatically adapting the given hierarchy based on the set of input-output pairs to yield a new hierarchy. The proposed method is not only theoretically justified but also provides a more principled approach for hierarchy flattening techniques proposed earlier, which are ad-hoc and empirical in nature. The empirical results on large-scale public datasets demonstrate that classification with new hierarchy leads to better or comparable generalization performance than the hierarchy flattening techniques. Moreover, since the proposed method largely maintains the overall hierarchical structure, it leads to faster prediction and lower space complexity.

show abstract

“…The current best practice on link suggestion is prefix matching over titles of Wikipedia articles, and existing document classification approaches are not proper for the category suggestion task due to their poor effectiveness and efficiency when dealing with large-scale category systems [27].…”

Section: Introductionmentioning

confidence: 99%

Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring

Wang

Zhu

et al. 2007

The Semantic Web

View full text Add to dashboard Cite

Abstract. Wikipedia, a killer application in Web 2.0, has embraced the power of collaborative editing to harness collective intelligence. It can also serve as an ideal Semantic Web data source due to its abundance, influence, high quality and well-structuring. However, the heavy burden of up-building and maintaining such an enormous and ever-growing online encyclopedic knowledge base still rests on a very small group of people. Many casual users may still feel difficulties in writing high quality Wikipedia articles. In this paper, we use RDF graphs to model the key elements in Wikipedia authoring, and propose an integrated solution to make Wikipedia authoring easier based on RDF graph matching, expecting making more Wikipedians. Our solution facilitates semantics reuse and provides users with: 1) a link suggestion module that suggests and auto-completes internal links between Wikipedia articles for the user; 2) a category suggestion module that helps the user place her articles in correct categories. A prototype system is implemented and experimental results show significant improvements over existing solutions to link and category suggestion tasks. The proposed enhancements can be applied to attract more contributors and relieve the burden of professional editors, thus enhancing the current Wikipedia to make it an even better Semantic Web data source.

show abstract

Support vector machines classification with a very large-scale taxonomy

Cited by 181 publications

References 16 publications

Statistical topic models for multi-label document classification

Statistical topic models for multi-label document classification

Maximum-Margin Framework for Training Data Synchronization in Large-Scale Hierarchical Classification

Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring

Contact Info

Product

Resources

About