Learning Only from Relevant Keywords and Unlabeled Documents

Charoenphakdee, Nontawat; Lee, Jongyeong; Jin, Yiping; Wanvarie, Dittaya; Sugiyama, Masashi

doi:10.18653/v1/d19-1411

Cited by 12 publications

(23 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we demonstrate how to apply the robustness result of symmetric losses to tackle a weaklysupervised natural language processing task, namely learning only from relevant keywords and unlabeled documents [Charoenphakdee et al, 2019a].…”

Section: A Symmetric Loss Approach To Learning Only From Relevant Abs...mentioning

confidence: 99%

“…The bottleneck of the method proposed by Jin et al [2017] is lack of flexibility of model choices and optimization algorithms. This makes it difficult to bring many Figure 2: An overview of the framework for learning only from relevant keywords and unlabeled document [Charoenphakdee et al, 2019a]. Blue documents indicate positive documents and red documents denote negative documents in the two sets of documents divided by a pseudo-labeling algorithm.…”

Section: Introductionmentioning

confidence: 99%

“…This article also demonstrates how to use a symmetric loss in a real-world problem in the context of natural language processing. We discuss an application of symmetric losses for learning a reliable classifier from only relevant keywords and unlabeled documents [Jin et al, 2017;Charoenphakdee et al, 2019a]. In this problem, we first collect unlabeled documents.…”

mentioning

confidence: 99%

“…Unlike collecting labels for every training document, collecting keywords can be much cheaper and the number of keywords does not necessarily scale with the number of unlabeled training documents [Chang et al, 2008;Song and Roth, 2014;Chen et al, 2015;Li and Yang, 2018;Jin et al, 2017Jin et al, , 2020]. We will discuss how this problem can be formulated into the framework of learning under mutually contaminated noise and how using a symmetric loss can be highly useful for solving this problem [Charoenphakdee et al, 2019a].…”

mentioning

confidence: 99%

See 3 more Smart Citations

A Symmetric Loss Perspective of Reliable Machine Learning

Charoenphakdee¹,

Lee²,

Sugiyama³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

When minimizing the empirical risk in binary classification, it is a common practice to replace the zero-one loss with a surrogate loss to make the learning objective feasible to optimize. Examples of well-known surrogate losses for binary classification include the logistic loss, hinge loss, and sigmoid loss. It is known that the choice of a surrogate loss can highly influence the performance of the trained classifier and therefore it should be carefully chosen. Recently, surrogate losses that satisfy a certain symmetric condition (aka., symmetric losses) have demonstrated their usefulness in learning from corrupted labels. In this article, we provide an overview of symmetric losses and their applications. First, we review how a symmetric loss can yield robust classification from corrupted labels in balanced error rate (BER) minimization and area under the receiver operating characteristic curve (AUC) maximization. Then, we demonstrate how the robust AUC maximization method can benefit natural language processing in the problem where we want to learn only from relevant keywords and unlabeled documents. Finally, we conclude this article by discussing future directions, including potential applications of symmetric losses for reliable machine learning and the design of non-symmetric losses that can benefit from the symmetric condition. IntroductionModern machine learning methods such as deep learning typically require a large amount of data to achieve desirable performance [Schmidhuber, 2015;Goodfellow et al., 2016]. However, it is often the case that the labeling process is costly and time-consuming. To mitigate this problem, one may consider collecting training labels through crowdsourcing [Dawid and Skene, 1979;Kittur et al., 2008], which is a popular approach and has become more convenient in the recent years [

show abstract

Section: A Symmetric Loss Approach To Learning Only From Relevant Abs...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

A Symmetric Loss Perspective of Reliable Machine Learning

Charoenphakdee¹,

Lee²,

Sugiyama³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…GE has been successfully applied on different tasks, such as text categorisation (Druck et al 2008) and language identification in mixed-language documents (King and Abney 2013). Similarly, Charoenphakdee et al (2019) proposed a theoretically grounded risk minimisation framework that directly optimises the area under the receiver operating characteristic curve (area under the curve) of a dataless classification model. Settles (2011) and Li and Yang (2018) both used multinomial naïve Bayes (MNB) for dataless classification.…”

Section: Dataless Classificationmentioning

confidence: 99%

Learning from noisy out-of-domain corpus using dataless classification

Jin

Wanvarie

Le³

2020

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain data set using statistical methods. We then apply a dataless classification method to learn from the automatically selected keywords and unlabelled in-domain data. The proposed approach outperformed various supervised learning and dataless classification baselines by a large margin. We evaluated different keyword selection methods intrinsically and extrinsically by measuring their impact on the dataless classification accuracy. Last but not least, we conducted an in-depth analysis of the behaviour of the classifier and explained why the proposed dataless classification method outperformed supervised learning counterparts.

show abstract