Learning from noisy out-of-domain corpus using dataless classification

Jin, Yiping; Wanvarie, Dittaya; Le, Phu T. V.

doi:10.1017/s1351324920000340

Cited by 12 publications

(21 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…use. Jin et al (2020) demonstrated that the choice of seed keywords has a significant impact on the model's accuracy. STM,S label is the result of STM using only unigrams in the category name as seed keywords.…”

Section: Results Of Coarse-grained Contextual Classificationmentioning

confidence: 99%

“…To investigate the contribution of the in-domain unlabeled document to STM's superior performance, we trained an STM model with the manually-curated keywords in Jin et al (2020) and the Wikipedia dataset we used to train wiki2cat (denoted as STM,D wiki ). There is a noticeable decrease in performance in STM,D wiki without indomain unlabeled documents.…”

Section: Results Of Coarse-grained Contextual Classificationmentioning

confidence: 99%

“…Both Doc2vec and STM require unlabeled training corpus. We copy the coarsegrained classification result for Doc2vec, STM, and STM,S label from Jin et al (2020). For fine-grained classification, we train Doc2vec and STM,S label using the same set of Wikipedia documents as in wiki2cat.…”

Section: Contextual-eval-datasetmentioning

confidence: 99%

See 2 more Smart Citations

Bootstrapping Large-Scale Fine-Grained Contextual Advertising Classifier from Wikipedia

Jin

Kadam²,

Wanvarie

2021

Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

Self Cite

View full text Add to dashboard Cite

Contextual advertising provides advertisers with the opportunity to target the context which is most relevant to their ads. The large variety of potential topics makes it very challenging to collect training documents to build a supervised classification model or compose expert-written rules in a rule-based classification system. Besides, in fine-grained classification, different categories often overlap or cooccur, making it harder to classify accurately.In this work, we propose wiki2cat, a method to tackle large-scaled fine-grained text classification by tapping on the Wikipedia category graph. The categories in the IAB taxonomy are first mapped to category nodes in the graph. Then the label is propagated across the graph to obtain a list of labeled Wikipedia documents to induce text classifiers. The method is ideal for large-scale classification problems since it does not require any manually-labeled document or hand-curated rules or keywords. The proposed method is benchmarked with various learning-based and keyword-based baselines and yields competitive performance on publicly available datasets and a new dataset containing more than 300 fine-grained categories.

show abstract

Section: Results Of Coarse-grained Contextual Classificationmentioning

confidence: 99%

Section: Results Of Coarse-grained Contextual Classificationmentioning

confidence: 99%

See 1 more Smart Citation

Bootstrapping Large-Scale Fine-Grained Contextual Advertising Classifier from Wikipedia

Jin

Kadam²,

Wanvarie

2021

Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We use either the category name or trivial keywords (e.g., "good" and "bad" for sentiment classification tasks) as the only input seed word and use a keyword expansion algorithm to mine more candidate keywords. We apply pmi-f req (Equation 1) following Jin et al (2020). It is a product of the logarithm of the candidate keyword w's document frequency and the point-wise mutual information between w and the seed word s. The higher the pmi-f req score, the more strongly the candidate keyword is associated with the seed word s. Additionally, we filter the mined keywords based on their part-of-speech tag depending on the classification task.…”

Section: Expanding Candidate Abstract From a Single Seedmentioning

confidence: 99%

“…Weakly-supervised text classification eliminates the need for any labeled document and induces classifiers with only a handful of carefully chosen seed words. However, some researchers pointed out that the choice of seed words has a significant impact on the performance of weakly-supervised models (Li et al, 2018;Jin et al, 2020). The vast majority of previous work assumed high-quality seed words are given.…”

Section: Introductionmentioning

confidence: 99%

Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation

Jin

Bhatia²,

Wanvarie

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Weakly-supervised text classification aims to induce text classifiers from only a few userprovided seed words. The vast majority of previous work assumes high-quality seed words are given. However, the expert-annotated seed words are sometimes non-trivial to come up with. Furthermore, in the weakly-supervised learning setting, we do not have any labeled document to measure the seed words' efficacy, making the seed word selection process "a walk in the dark". In this work, we remove the need for expert-curated seed words by first mining (noisy) candidate seed words associated with the category names. We then train interim models with individual candidate seed words. Lastly, we estimate the interim models' error rate in an unsupervised manner. The seed words that yield the lowest estimated error rates are added to the final seed word set. A comprehensive evaluation of six binary classification tasks on four popular datasets demonstrates that the proposed method outperforms a baseline using only category name seed words and obtained comparable performance as a counterpart using expert-annotated seed words 1 .1. We propose a novel combination of unsupervised error estimation and weakly-supervised text classification to improve the classification performance and robustness.2. We conduct an in-depth study on the impact of different seed words on weakly-supervised text classification, supported by experiments

show abstract

Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach

Qiu,

Tian,

Xie

et al. 2023

J. Earth Sci.

View full text Add to dashboard Cite

Learning from noisy out-of-domain corpus using dataless classification

Cited by 12 publications

References 32 publications

Bootstrapping Large-Scale Fine-Grained Contextual Advertising Classifier from Wikipedia

Bootstrapping Large-Scale Fine-Grained Contextual Advertising Classifier from Wikipedia

Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation

Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach

Contact Info

Product

Resources

About