When Does Cotraining Work in Real Data?

Du, Jiangfeng; Ling, Charles X.; Zhou, Zhi‐Hua

doi:10.1109/tkde.2010.158

Cited by 76 publications

(40 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Nevertheless, this method has become an example for recent models thanks to the idea of using the agreement (or disagreement) of multiple classifiers and the mutual teaching approach. A good study of when co-training works can be found in [32].…”

Section: B Self-labeled Techniques: Previous Workmentioning

confidence: 99%

SEG-SSC: A Framework Based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification

Triguero

García

Herrera

2015

IEEE Trans. Cybern.

View full text Add to dashboard Cite

A note on versions:The version presented here may differ from the published version or from the version of record. If you wish to cite this item you are advised to consult the publisher's version. Please see the repository url above for details on accessing the published version and note that access may require a subscription. Abstract-Self-labeled techniques are semi-supervised classification methods that address the shortage of labeled examples via a self-learning process based on supervised models. They progressively classify unlabeled data and use them to modify the hypothesis learned from labeled samples. Most relevant proposals are currently inspired by boosting schemes to iteratively enlarge the labeled set. Despite their effectiveness, these methods are constrained by the number of labeled examples and their distribution, which in many cases is sparse and scattered. The aim of this work is to design a framework, named SEG-SSC, to improve the classification performance of any given self-labeled method by using synthetic labeled data. These are generated via an oversampling technique and a positioning adjustment model that use both labeled and unlabeled examples as reference. Next, these examples are incorporated in the main stages of the self-labeling process. The principal aspects of the proposed framework are: (a) introducing diversity to the multiple classifiers used by using more (new) labeled data, (b) fulfilling labeled data distribution with the aid of unlabeled data, and (c) being applicable to any kind of self-labeled method. In our empirical studies, we have applied this scheme to four recent self-labeled methods, testing their capabilities with a large number of data sets. We show that this framework significantly improves the classification capabilities of self-labeled techniques.

show abstract

Section: B Self-labeled Techniques: Previous Workmentioning

confidence: 99%

SEG-SSC: A Framework Based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification

Triguero

García

Herrera

2015

IEEE Trans. Cybern.

View full text Add to dashboard Cite

show abstract

“…Some promising results have been achieved in this field [3][4][5][6][7], but this proved to be a difficult task, as the relation between the characteristics of the views and the performance of cotraining has not been sufficiently understood. Moreover, research [4] indicates that given a small training dataset as in real-world situations where co-training is called for, the sufficiency and independence assumptions cannot be reliably verified, making the split methods unreliable and application of co-training uncertain.…”

Section: Related Workmentioning

confidence: 99%

“…In addition, we performed experiments on 14 binary and 8 multi-class UCI datasets also previously used for evaluating co-training [4,8,13]. The benchmark datasets of various properties were selected to give us a better insight of how effective our method is on datasets of various dimensionality, size and redundancy.…”

Section: A Datasets and Configurationmentioning

confidence: 99%

“…All groups are sorted by the parameter Gap. Notation -Dataset: the number in parentheses following the dataset names denotes the number of classes; Dim: the number of features describing the dataset; |L|: the size of the initial training set L; L acc : accuracy achieved by a supervised Naive Bayes classifier trained on the initial set L; |All|: the size of the entire training set All (i.e., the sum of numbers of labeled and unlabeled examples); All acc : accuracy achieved by supervised Naive Bayes classifier trained on the entire training set All (i.e., labeled examples and unlabeled examples with correct label); Gap: performance gap (also called the optimal gain in [4]) computed as All acc -L acc .…”

Section: A Datasets and Configurationmentioning

confidence: 99%

See 1 more Smart Citation

Semi-Supervised Learning on Single-View Datasets by Integration of Multiple Co-trained Classifiers

Slivka

Zhang

Kovačević

et al. 2012

2012 11th International Conference on Machine Learning and Applications

View full text Add to dashboard Cite

Abstract-We propose a novel semi-supervised learning algorithm, called IMCC, designed for co-training classifiers on single-view datasets. Our method runs the co-training algorithm for a predefined number of times, each time using a different random split of features. Thus, a set of diverse cotraining classifiers is created. Each of these classifiers then labels each of the examples for which we want to determine the class label. In this way, each example for classification is assigned multiple labels. We then treat this as a problem of learning from inconsistent and unreliable annotators in a multi-annotator problem setting and estimate the single hidden true label for each example. In experimental results obtained on 25 benchmark datasets of various properties IMCC outperformed five considered alternative methods for cotraining on single-view datasets, and resulted in a statistical tie with a Naive Bayes classifier trained using a much larger set of labeled examples.

show abstract

“…It is evident, however, that a random split would not work in most cases. Du et al [8] tried several heuristics for view split and found that all heuristics failed with insufficient labeled data. The necessary condition of co-training given in [24] suggested that among all potential view splits, the one which enables the most unlabeled instances connect with labeled examples in the combinative graph is preferred; this was empirically verified in [24] and might give inspiration to develop sound practical view split approaches.…”

Section: About the Viewsmentioning

confidence: 99%

Unlabeled Data and Multiple Views

Zhou

2012

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. In many real-world applications there are usually abundant unlabeled data but the amount of labeled training examples are often limited, since labeling the data requires extensive human effort and expertise. Thus, exploiting unlabeled data to help improve the learning performance has attracted significant attention. Major techniques for this purpose include semi-supervised learning and active learning. These techniques were initially developed for data with a single view, that is, a single feature set; while recent studies showed that for multi-view data, semi-supervised learning and active learning can amazingly well. This article briefly reviews some recent advances of this thread of research.

show abstract

When Does Cotraining Work in Real Data?

Cited by 76 publications

References 13 publications

SEG-SSC: A Framework Based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification

SEG-SSC: A Framework Based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification

Semi-Supervised Learning on Single-View Datasets by Integration of Multiple Co-trained Classifiers

Unlabeled Data and Multiple Views

Contact Info

Product

Resources

About