2020
DOI: 10.1017/s1351324920000340
|View full text |Cite
|
Sign up to set email alerts
|

Learning from noisy out-of-domain corpus using dataless classification

Abstract: In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain data set using statistical methods. We then apply a dataless classification method to learn from the automati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 12 publications
(21 citation statements)
references
References 32 publications
0
14
0
Order By: Relevance
“…use. Jin et al (2020) demonstrated that the choice of seed keywords has a significant impact on the model's accuracy. STM,S label is the result of STM using only unigrams in the category name as seed keywords.…”
Section: Results Of Coarse-grained Contextual Classificationmentioning
confidence: 99%
See 2 more Smart Citations
“…use. Jin et al (2020) demonstrated that the choice of seed keywords has a significant impact on the model's accuracy. STM,S label is the result of STM using only unigrams in the category name as seed keywords.…”
Section: Results Of Coarse-grained Contextual Classificationmentioning
confidence: 99%
“…To investigate the contribution of the in-domain unlabeled document to STM's superior performance, we trained an STM model with the manually-curated keywords in Jin et al (2020) and the Wikipedia dataset we used to train wiki2cat (denoted as STM,D wiki ). There is a noticeable decrease in performance in STM,D wiki without indomain unlabeled documents.…”
Section: Results Of Coarse-grained Contextual Classificationmentioning
confidence: 99%
See 1 more Smart Citation
“…We use either the category name or trivial keywords (e.g., "good" and "bad" for sentiment classification tasks) as the only input seed word and use a keyword expansion algorithm to mine more candidate keywords. We apply pmi-f req (Equation 1) following Jin et al (2020). It is a product of the logarithm of the candidate keyword w's document frequency and the point-wise mutual information between w and the seed word s. The higher the pmi-f req score, the more strongly the candidate keyword is associated with the seed word s. Additionally, we filter the mined keywords based on their part-of-speech tag depending on the classification task.…”
Section: Expanding Candidate Abstract From a Single Seedmentioning
confidence: 99%
“…Weakly-supervised text classification eliminates the need for any labeled document and induces classifiers with only a handful of carefully chosen seed words. However, some researchers pointed out that the choice of seed words has a significant impact on the performance of weakly-supervised models (Li et al, 2018;Jin et al, 2020). The vast majority of previous work assumed high-quality seed words are given.…”
Section: Introductionmentioning
confidence: 99%