Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings

Yang, Yi; Wang, Hongan; Zhu, Jiaqi; Wu, Yunkun; Jiang, Kailong; Guo, Wenli; Shi, Wandong

doi:10.24963/ijcai.2020/549

Cited by 13 publications

(5 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At an early stage, some researchers used auxiliary knowledge bases like Wikipedia to establish the semantic correlation between texts and labels [3,29]. Subsequently, topic-model based methods emerged [4,13,14,33,34], which inferred category-aware topics from a limited set of seed words. In the last few years, neural methods has gained prominance [22,23,31,36,39].…”

Section: Related Work 21 Weakly Supervised Text Classificationmentioning

confidence: 99%

RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

Li,

Zhu,

Wang

et al. 2024

Proceedings of the ACM Web Conference 2024

View full text Add to dashboard Cite

Weakly supervised text classification (WSTC), also called zero-shot or dataless text classification, has attracted increasing attention due to its applicability in classifying a mass of texts within the dynamic and open Web environment, since it requires only a limited set of seed words (label names) for each category instead of labeled data. With the help of recently popular prompting Pre-trained Language Models (PLMs), many studies leveraged manually crafted and/or automatically identified verbalizers to estimate the likelihood of categories, but they failed to differentiate the effects of these category-indicative words, let alone capture their correlations and realize adaptive adjustments according to the unlabeled corpus. In this paper, in order to let the PLM effectively understand each category, we at first propose a novel form of rule-based knowledge using logical expressions to characterize the meanings of categories. Then, we develop a prompting PLM-based approach named RulePrompt for the WSTC task, consisting of a rule mining module and a rule-enhanced pseudo label generation module, plus a self-supervised fine-tuning module to make the PLM align with this task. Within this framework, the inaccurate pseudo labels assigned to texts and the imprecise logical rules associated with categories mutually enhance each other in an alternative manner. That establishes a self-iterative closed loop of knowledge (rule) acquisition and utilization, with seed words serving as the starting point. Extensive experiments validate the effectiveness and robustness of our approach, which markedly outperforms state-of-the-art weakly supervised methods. What is more, our approach yields interpretable category rules, proving its advantage in disambiguating easily-confused categories.

show abstract

Section: Related Work 21 Weakly Supervised Text Classificationmentioning

confidence: 99%

RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

Li,

Zhu,

Wang

et al. 2024

Proceedings of the ACM Web Conference 2024

View full text Add to dashboard Cite

show abstract

“…First, we associate each topic z with an individual attribute value c q , and initialize states for the Markov chain randomly like BTM. Next, inspired by (Yang et al 2020), we define the conditional distribution P (c q |c ¬b i , h ,l , B, α, β) for each biterm b i,h,l in the biterm set B via combining the biterm-attribute value similarity score Ω(b i,h,l , c q ) with the conditional distribution P(c q |c ¬b i , h ,l , B, α, β) (Formula 1) as follows:…”

Section: Attribute Knowledge Integration (Aki) Modulementioning

confidence: 99%

Low-Resource Personal Attribute Prediction from Conversations

Liu

Chen

Shen

et al. 2023

AAAI

View full text Add to dashboard Cite

Personal knowledge bases (PKBs) are crucial for a broad range of applications such as personalized recommendation and Web-based chatbots. A critical challenge to build PKBs is extracting personal attribute knowledge from users' conversation data. Given some users of a conversational system, a personal attribute and these users' utterances, our goal is to predict the ranking of the given personal attribute values for each user. Previous studies often rely on a relative number of resources such as labeled utterances and external data, yet the attribute knowledge embedded in unlabeled utterances is underutilized and their performance of predicting some difficult personal attributes is still unsatisfactory. In addition, it is found that some text classification methods could be employed to resolve this task directly. However, they also perform not well over those difficult personal attributes. In this paper, we propose a novel framework PEARL to predict personal attributes from conversations by leveraging the abundant personal attribute knowledge from utterances under a low-resource setting in which no labeled utterances or external data are utilized. PEARL combines the biterm semantic information with the word co-occurrence information seamlessly via employing the updated prior attribute knowledge to refine the biterm topic model's Gibbs sampling process in an iterative manner. The extensive experimental results show that PEARL outperforms all the baseline methods not only on the task of personal attribute prediction from conversations over two data sets, but also on the more general weakly supervised text classification task over one data set.

show abstract

“…First, we associate each topic 𝑧 with an individual attribute value 𝑐 𝑞 , and initialize states for the Markov chain randomly like BTM. Next, inspired by (Yang et al 2020), we define the conditional distribution 𝑃 (𝑐 𝑞 |c ¬𝑏 𝑖,ℎ,𝑙 , 𝔅, 𝛼, 𝛽) for each biterm 𝑏 𝑖,ℎ,𝑙 in the biterm set 𝔅 via combining the biterm-attribute value similarity score Ω(𝑏 𝑖,ℎ,𝑙 , 𝑐 𝑞 ) with the conditional distribution 𝑃(𝑐 𝑞 |c ¬𝑏 𝑖,ℎ,𝑙 , 𝔅, 𝛼, 𝛽) (Formula 1) as follows:…”

Section: Attribute Knowledge Integration (Aki) Modulementioning

confidence: 99%

“…Specifically, ConWea (Mekala and Shang 2020) can utilize userprovided seed words to create a contextualized utterance corpus, which is further leveraged to train an utterance classifier and expand seed words iteratively. SeedBTM (Yang et al 2020) could utilize user-provided seed words to extend BTM into an utterance classifier based on the word embedding technique. LOTClass (Meng et al 2020) generates some attribute-indicative words for each attribute value to fine-tune a PLM on a word-level category prediction task, and then does self-training on unlabeled utterances.…”

Section: Effectiveness Studymentioning

confidence: 99%

Personal Attribute Prediction from Conversations

Liu

Chen

Shen

2022

Companion Proceedings of the Web Conference 2022

View full text Add to dashboard Cite

Personal knowledge bases (PKBs) are critical to many applications, such as Web-based chatbots and personalized recommendation. Conversations containing rich personal knowledge can be regarded as a main source to populate the PKB. Given a user, a user attribute, and user utterances from a conversational system, we aim to predict the personal attribute value for the user, which is helpful for the enrichment of PKBs. However, there are three issues existing in previous studies: (1) manually labeled utterances are required for model training; (2) personal attribute knowledge embedded in both utterances and external resources is underutilized; (3) the performance on predicting some difficult personal attributes is unsatisfactory. In this paper, we propose a framework DSCGN based on the pre-trained language model with a noise-robust loss function to predict personal attributes from conversations without requiring any labeled utterances. We yield two categories of supervision, i.e., document-level supervision via a distant supervision strategy and contextualized word-level supervision via a label guessing method, by mining the personal attribute knowledge embedded in both unlabeled utterances and external resources to fine-tune the language model. Extensive experiments over two real-world data sets (i.e., a profession data set and a hobby data set) show our framework obtains the best performance compared with all the twelve baselines in terms of nDCG and MRR. CCS CONCEPTS• Computing methodologies → Information extraction.

show abstract

Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings

Cited by 13 publications

References 1 publication

RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

Low-Resource Personal Attribute Prediction from Conversations

Personal Attribute Prediction from Conversations

Contact Info

Product

Resources

About