Active Learning With Sampling by Uncertainty and Density for Data Annotations

Zhu, Jun; Wang, H; Tsou, Benjamin K.; Ma, Mingyong

doi:10.1109/tasl.2009.2033421

Cited by 108 publications

(61 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…And the key for increasing accuracy of active machine learning algorithm depends on the selection of high-informative samples [13]. Conventional active learning algorithms formed the initial training set by selecting high-representative samples through clustering analysis [14], and then labeled the most uncertain samples. However, usually achieving unsatisfying effects in interactive information retrieval is due to small initial training set and existence of outliers [15].…”

Section: State Of the Artmentioning

confidence: 99%

Untitled

2017

JESTR

View full text Add to dashboard Cite

Comprehending user demands through several human-computer interactions can effectively increase informationretrieval accuracy. Mainstream active learning algorithms use uncertainty sampling strategy. However, such algorithms cannot produce satisfactory results under few interactions. To improve interactive information-retrieval efficiency and accuracy, an sampling strategy based on the error-correcting capacity of samples was proposed for active learning. This strategy evaluated the expected value of unlabeled samples by calculating their potential error-correcting capacity associated with the classifier. Based on this sampling strategy, a fast interactive information-retrieval scheme adopting reinforcement learning and low-complexity classifier was designed in this study. The effects of three sampling strategies (random sampling, uncertainty sampling, and the proposed sampling strategy based on error-correcting capacity) on information-retrieval accuracy were examined using an experiment through a text set of Reuters-21578. Experimental results demonstrated that the proposed sampling strategy achieved higher retrieval accuracy and stability than random and uncertainty samplings. The retrieval accuracy of the proposed scheme was approximately 1.6% higher than that of the sampling algorithm based on uncertainty strategy. The proposed scheme can be used for real-time information retrieval because of its low computational complexity. The production of this study can improve the accuracy and latency of interactive information-retrieval services.

show abstract

Section: State Of the Artmentioning

confidence: 99%

Untitled

2017

JESTR

View full text Add to dashboard Cite

show abstract

“…In addition to this confidence-based uncertainty measure, other measures are common as well (Settles 2012), like entropy or the margin between a candidate and the decision boundary. Similar to the issue of the true posterior above, a known drawback (Zhu et al 2010) of US is that these proxies do not consider the number of similar instances on which the posterior estimates are made or the decision boundaries are drawn. The reported results of empirical evaluations are somewhat inconclusive, with some authors [e.g.…”

Section: Background and Related Workmentioning

confidence: 99%

Optimised probabilistic active learning (OPAL)

2015

View full text Add to dashboard Cite

In contrast to ever increasing volumes of automatically generated data, human annotation capacities remain limited. Thus, fast active learning approaches that allow the efficient allocation of annotation efforts gain in importance. Furthermore, cost-sensitive applications such as fraud detection pose the additional challenge of differing misclassification costs between classes. Unfortunately, the few existing cost-sensitive active learning approaches rely on time-consuming steps, such as performing self-labelling or tedious evaluations over samples. We propose a fast, non-myopic, and cost-sensitive probabilistic active learning approach for binary classification. Our approach computes the expected reduction in misclassification loss in a labelling candidate's neighbourhood. We derive and use a closedform solution for this expectation, which considers the possible values of the true posterior of the positive class at the candidate's position, its possible label realisations, and the given labelling budget. The resulting myopic algorithm runs in the same linear asymptotic time as uncertainty sampling, while its non-myopic counterpart requires an additional factor of O(m · log m) in the budget size. The experimental evaluation on several synthetic and realworld data sets shows competitive or better classification performance and runtime, compared to several uncertainty sampling-and error-reduction-based active learning strategies, both in cost-sensitive and cost-insensitive settings.

show abstract

“…Xu et al (2003) proposed a representative sampling method, which first cluster the unlabeled examples located in the margin of an SVM classifier, and then queries the labels of the examples that are close to each cluster centroid. Zhu et al (2010) presented a method called K-Nearest-Neighbor-based density measure that quantifies density by the average similarity between an unlabeled example and its K nearest neighbors, and weighted the entropy-based uncertainty by the KNN density. McCallum et al (1998) proposed a density-weighted QBC algorithm, which chooses examples with the highest committee disagreement in predicted labels weighted by sample density.…”

Section: General Active Learningmentioning

confidence: 99%

“…However, the major shortcoming is that they cannot differentiate outliers from informative points, and thus often fail by selecting outliers (Settles 2012). To solve this so-called outlier problem, several density-weighted active learning approaches have been proposed by modeling the input distribution explicitly during data sampling (Xu et al 2003;Zhu et al 2010;McCallum and Nigam 1998;Nguyen and Smeulders 2004. The central idea of using prior data density in active learning is that it considers the whole input space rather than individual data points.…”

mentioning

confidence: 99%

Active learning for ranking with sample density

Cai

Zhang

2015

Inf Retrieval J

View full text Add to dashboard Cite

While ranking is widely used in many online domains such as search engines and recommendation systems, it is non-trivial to label enough data examples to build a high performance machine-learned ranking model. To relieve this problem, active learning has been proposed, which selectively labels the most informative examples. However, data density, which has been proven helpful for data sampling in general, is ignored by most of the existing active learning for ranking studies. In this paper, we propose a novel active learning for ranking framework, generalization error minimization (GEM), which incorporates data density in minimizing generalization error. Concentrating on active learning for search ranking, we employ classical kernel density estimation to infer data density. Considering the unique query-document structure in ranking data, we estimate sample density at both query level and document level. Under the GEM framework, we propose new active learning algorithms at both query level and document level. Experimental results on the LETOR 4.0 data set and a real-world Web search ranking data set from a commercial search engine have demonstrated the effectiveness of the proposed active learning algorithms.

show abstract

Active Learning With Sampling by Uncertainty and Density for Data Annotations

Cited by 108 publications

References 21 publications

Untitled

Untitled

Optimised probabilistic active learning (OPAL)

Active learning for ranking with sample density

Contact Info

Product

Resources

About