Statistical Hypothesis Testing in Positive Unlabelled Data

Sechidis, Konstantinos; Calvo, B.; Brown, Gavin

doi:10.1007/978-3-662-44845-8_5

Cited by 10 publications

(11 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the power of the test differs with a constant correction factor 1−α α Pr(s=0) 1−Pr(s=0) . Because the correction factor is a constant that depends on the amount of labeled data, one can calculate how much more data is required to get the desired power [90]. The conditional test of independence, which was used for learning the PTAN trees, has similar properties [9,88].…”

Section: Hypothesis Testingmentioning

confidence: 99%

Learning from positive and unlabeled data: a survey

2020

View full text Add to dashboard Cite

Learning from positive and unlabeled data or PU learning is the setting where a learner only has access to positive examples and unlabeled data. The assumption is that the unlabeled data can contain both positive and negative examples. This setting has attracted increasing interest within the machine learning literature as this type of data naturally arises in applications such as medical diagnosis and knowledge base completion. This article provides a survey of the current state of the art in PU learning. It proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.

show abstract

Section: Hypothesis Testingmentioning

confidence: 99%

Learning from positive and unlabeled data: a survey

2020

View full text Add to dashboard Cite

show abstract

“…3.4 and 5 in Sechidis and Brown (2015), while parts of Sect. 3.3 in Sechidis et al (2014). Those two previous works focused only on feature selection through hypothesis testing.…”

Section: Results On Semi-supervised Feature Rankingmentioning

confidence: 99%

“…number of labelled examples) needed, following the same procedure as in sample size determination. In our previous work (Sechidis et al 2014), we presented a complete methodology for sample/labelled size determination in positive-unlabelled scenarios by using the κ Y 0 correction factor and surrogate Y 0 .…”

Section: Theorem 4 (Mar-c: Informed Surrogate Approaches) In Mar-c Onmentioning

confidence: 99%

See 1 more Smart Citation

Simple strategies for semi-supervised feature selection

Sechidis

Brown

2017

Mach Learn

Self Cite

View full text Add to dashboard Cite

What is the simplest thing you can do to solve a problem? In the context of semisupervised feature selection, we tackle exactly this-how much we can gain from two simple classifier-independent strategies. If we have some binary labelled data and some unlabelled, we could assume the unlabelled data are all positives, or assume them all negatives. These minimalist, seemingly naive, approaches have not previously been studied in depth. However, with theoretical and empirical studies, we show they provide powerful results for feature selection, via hypothesis testing and feature ranking. Combining them with some "soft" prior knowledge of the domain, we derive two novel algorithms (Semi-JMI, Semi-IAMB) that outperform significantly more complex competing methods, showing particularly good performance when the labels are missing-not-at-random. We conclude that simple approaches to this problem can work surprisingly well, and in many situations we can provably recover the exact feature selection dynamics, as if we had labelled the entire dataset.

show abstract

“…Building upon this assumption, Sechidis et al [19] proved that we can test independence between a feature X and the unobservable variable Y, by simply testing the independence between X and the observable variable S P , which can be seen as a surrogate version of Y. While this assumption is sufficient for testing independence and guarantees the same probability of false positives, it leads to a less powerful test, and the probability of committing a false negative error is increased by a factor which can be calculated using prior knowledge over p(y + ).…”

Section: Positive-unlabelled Datamentioning

confidence: 98%

Markov Blanket Discovery in Positive-Unlabelled and Semi-supervised Data

Sechidis

Brown

2015

Machine Learning and Knowledge Discovery in Databases

Self Cite

View full text Add to dashboard Cite

Abstract. The importance of Markov blanket discovery algorithms is twofold: as the main building block in constraint-based structure learning of Bayesian network algorithms and as a technique to derive the optimal set of features in filter feature selection approaches. Equally, learning from partially labelled data is a crucial and demanding area of machine learning, and extending techniques from fully to partially supervised scenarios is a challenging problem. While there are many different algorithms to derive the Markov blanket of fully supervised nodes, the partially-labelled problem is far more challenging, and there is a lack of principled approaches in the literature. Our work derives a generalization of the conditional tests of independence for partially labelled binary target variables, which can handle the two main partially labelled scenarios: positive-unlabelled and semi-supervised. The result is a significantly deeper understanding of how to control false negative errors in Markov Blanket discovery procedures and how unlabelled data can help.

show abstract

Statistical Hypothesis Testing in Positive Unlabelled Data

Cited by 10 publications

References 13 publications

Learning from positive and unlabeled data: a survey

Learning from positive and unlabeled data: a survey

Simple strategies for semi-supervised feature selection

Markov Blanket Discovery in Positive-Unlabelled and Semi-supervised Data

Contact Info

Product

Resources

About