Topic Modeling for Short Texts via Word Embedding and Document Correlation

Feng, Yi; Jiang, Bo; Wu, Jianjun

doi:10.1109/access.2020.2973207

Cited by 35 publications

(25 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we leverage the Topically Driven Neural Language Model (TDLM) (Lau et al, 2017) to obtain topic representations, as it can employ pre-trained embeddings which are found to be more suitable for short Twitter comments (Yi et al, 2020). The original model of TDLM applies a Convolutional Neural Network (CNN) over wordembeddings to generate a comment embedding.…”

Section: Combining Topic Model and Hatebertmentioning

confidence: 99%

Generalisability of Topic Models in Cross-corpora Abusive Language Detection

Bose¹,

Illina²,

Fohr³

2021

Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

View full text Add to dashboard Cite

Rapidly changing social media content calls for robust and generalisable abuse detection models. However, the state-of-the-art supervised models display degraded performance when they are evaluated on abusive comments that differ from the training corpus. We investigate if the performance of supervised models for cross-corpora abuse detection can be improved by incorporating additional information from topic models, as the latter can infer the latent topic mixtures from unseen samples. In particular, we combine topical information with representations from a model tuned for classifying abusive comments. Our performance analysis reveals that topic models are able to capture abuse-related topics that can transfer across corpora, and result in improved generalisability.

show abstract

Section: Combining Topic Model and Hatebertmentioning

confidence: 99%

Generalisability of Topic Models in Cross-corpora Abusive Language Detection

Bose¹,

Illina²,

Fohr³

2021

Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

View full text Add to dashboard Cite

show abstract

“…A global and local model GLTM [16] integrates both word embeddings trained from short texts corpus and auxiliary corpus. TRNMF [17] uses word embeddings to generate sentence similarity regularization and integrates with word co-occurrence. CME-DMM [31] is a collaboratively modeling and embedding framework incorporating topic and word embeddings.…”

Section: Related Workmentioning

confidence: 99%

“…But meta-data as auxiliary information is not always avail-able. Recent works prefer to incorporate word embeddings information [14]- [17]. But word embeddings trained from inappropriate auxiliary corpus will lead to poor performance [18].…”

Section: Introductionmentioning

confidence: 99%

A Pitman-Yor Process Self-Aggregated Topic Model for Short Texts of Social Media

Niu

Zhang

2021

IEEE Access

View full text Add to dashboard Cite

In recent years, with the rapid growth of social media, short texts have been very prevalent on the internet. Due to the limited length of each short text, word co-occurrence information in this type of documents is sparse. Conventional topic models based on word co-occurrence are unable to distill coherent topics on short texts. A state-of-the-art strategy is self-aggregated topic models which implicitly aggregate short texts into latent long documents. But these models have two problems. One problem is that the number of long documents should be defined explicitly and the inappropriate number leads to poor performance. Another problem is that latent long documents may bring non-sematic word co-occurrence which brings incoherent topics. In this article, we firstly apply the Chinese restaurant process to automatically generate the number of long documents according to the scale of short texts. Then to exclude non-semantic word cooccurrence, we propose a novel probabilistic model generating latent long documents in a more semantically way. Specifically, our model employs a pitman-yor process to aggregate short texts into long documents. This stochastic process can guarantee that the distribution between short texts and long documents following a power-law distribution which can be found in social media like Twitter. Finally, we compared our method with several state-of-the-art methods on four real short texts corpus. The experiment results show that our model performs superior to other methods with the metrics of topic coherence and text classification.

show abstract

“…Jiang et al [18] proposed a novel text classification algorithm, based on the Ant Colony Optimization (ACO). It abused the discreteness of the features of the text document and the value the ACO provides in addressing discrete issues.…”

Section: Related Workmentioning

confidence: 99%

A Hybrid Document Features Extraction with Clustering based Classification Framework on Large Document Sets

Devi¹,

Kumar²

2020

IJACSA

View full text Add to dashboard Cite

As the size of the document collections are increasing day-by-day, finding an essential document clusters for classification problem is one of the major problem due to high inter and intra document variations. Also, most of the conventional classification models such as SVM, neural network and Bayesian models have high true negative rate and error rate for document classification process. In order to improve the computational efficacy of the traditional document classification models, a hybrid feature extraction-based document cluster approach and classification approaches are developed on the large document sets. In the proposed work, a hybrid glove feature selection model is proposed to improve the contextual similarity of the keywords in the large document corpus. In this work, a hybrid document clustering similarity index is optimized to find the essential key document clusters based on the contextual keywords. Finally, a hybrid document classification model is used to classify the clustered documents on large corpus. Experimental results are conducted on different datasets, it is noted that the proposed document clustering-based classification model has high true positive rate, accuracy and low error rate than the conventional models.

show abstract

Topic Modeling for Short Texts via Word Embedding and Document Correlation

Cited by 35 publications

References 40 publications

Generalisability of Topic Models in Cross-corpora Abusive Language Detection

Generalisability of Topic Models in Cross-corpora Abusive Language Detection

A Pitman-Yor Process Self-Aggregated Topic Model for Short Texts of Social Media

A Hybrid Document Features Extraction with Clustering based Classification Framework on Large Document Sets

Contact Info

Product

Resources

About