Evaluating Topic Quality with Posterior Variability

Probabilistic topic models have become one of the most widespread machine learning techniques in textual analysis. Topic discovering is an unsupervised process that does not guarantee the interpretability of its output. Hence, the automatic evaluation of topic coherence has attracted the interest of many researchers over the last decade, and it is an open research area. This article offers a new quality evaluation method based on statistically validated networks (SVNs). The proposed probabilistic approach consists of representing each topic as a weighted network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrence in sentences against the null hypothesis of random co-occurrence. The proposed method allows one to distinguish between high-quality and low-quality topics, by making use of a battery of statistical tests. The statistically significant pairwise associations of words represented by the links in the SVN might reasonably be expected to be strictly related to the semantic coherence and interpretability of a topic. Therefore, the more connected the network, the more coherent the topic in question. We demonstrate the effectiveness of the method through an analysis of a real text corpus, which shows that the proposed measure is more correlated with human judgement than the state-of-the-art coherence measures.

show abstract

“…We evaluated our estimator of topic quality on a dataset of articles extracted from the New York Times, which was already analysed by 47 .…”

Section: Dataset and Pre-processingmentioning

confidence: 99%

Ranking coherence in topic models using statistically validated networks

Simonetti

Albano

Plaia

et al. 2023

Journal of Information Science

View full text Add to dashboard Cite

show abstract

“…Unsupervised methods typically design features based on the assumption that segments in the same topic are more coherent than those that belong to different topics, such as lexical cohesion (Hearst, 1997;Choi, 2000;Riedl and Biemann, 2012b), topic models (Misra et al, 2009;Riedl and Biemann, 2012a;Jameel and Lam, 2013;Du et al, 2013) and semantic embedding (Glavaš et al, 2016;Solbiati et al, 2021;Xing and Carenini, 2021). In contrast, supervised models can achieve more precise predictions by automatically mining clues of topic shift from large amounts of labeled data, either by classification on the pairs of sentences or chunks (Wang et al, 2017;Lukasik et al, 2020) or sequence labeling on the whole input sequence (Koshorek et al, 2018;Badjatiya et al, 2018;Xing et al, 2020;. However, the memory consumption and efficiency of neural models such as BERT (Kenton and Toutanova, 2019) can be limiting factors for modeling long documents as their length increases.…”

Section: Topic Segmentation Modelsmentioning

confidence: 99%

“…However the fluency of the constructed document is too low so that the semantic information is basically lost. Xing et al (2020) proposed to add the Consecutive Sentence-pair Coherence (CSC) task by computing the cosine similarity as coherence score. But no more incoherent sentence pairs are considered in CSC, except for those located at segment boundaries.…”

Section: Coherence Modelingmentioning

confidence: 99%

“…Supervised models have achieved large gains on topic segmentation through pre-training language models (PLMs) (e.g., BERT) and fine-tuning on large-scale supervised datasets (Kenton and Toutanova, 2019;Lukasik et al, 2020;Inan et al, 2022). Recently, (Arnold et al, 2019;Xing et al, 2020;Somasundaran et al, 2020;Lo et al, 2021) improve topic segmentation performance by explicitly modeling text coherence. However, these approaches either neglect context modeling beyond adjacent sentences (Wang et al, 2017), or require additional label information (Arnold et al, 2019;Barrow et al, 2020;Lo et al, 2021;Inan et al, 2022), or impede learning sentence-pair coherence without considering both coherent and incoherent pairs (Xing et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

“…Recently, (Arnold et al, 2019;Xing et al, 2020;Somasundaran et al, 2020;Lo et al, 2021) improve topic segmentation performance by explicitly modeling text coherence. However, these approaches either neglect context modeling beyond adjacent sentences (Wang et al, 2017), or require additional label information (Arnold et al, 2019;Barrow et al, 2020;Lo et al, 2021;Inan et al, 2022), or impede learning sentence-pair coherence without considering both coherent and incoherent pairs (Xing et al, 2020). Moreover, compared to short documents, topic segmentation becomes more critical for understanding long documents, and coherence modeling for long document topic segmentation is more crucial.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Yu,

Deng,

Zhang

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Topic segmentation is critical for obtaining structured documents and improving downstream tasks such as information retrieval. Due to its ability of automatically exploring clues of topic shift from abundant labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship between coherence and topic segmentation underexplored. Therefore, this paper enhances the ability of supervised models to capture coherence from both logical structure and semantic similarity perspectives to further improve the topic segmentation performance, proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations between adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at topic and sentence levels. Moreover, we utilize inter-and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher similarity, while those in different topics are less similar. Extensive experiments show that Longformer with our approach significantly outperforms state-of-the-art (SOTA) methods. Our approach improves F 1 of SOTA by 3.42 (73.74 → 77.16) and improves P k by 1.11 points (15.0 → 13.89) on WIKI-727K and achieves an average relative reduction of 4.3% on P k on WikiSection. The average relative P k drop of 8.38% on two out-of-domain datasets also demonstrates the robustness of our approach 1 .

show abstract

Developing an approach for lifestyle identification based on explicit and implicit features from social media

Khodorchenko

Butakov

2018

Procedia Computer Science

View full text Add to dashboard Cite

Evaluating Topic Quality with Posterior Variability

Cited by 8 publications

References 16 publications

Ranking coherence in topic models using statistically validated networks

Ranking coherence in topic models using statistically validated networks

Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Developing an approach for lifestyle identification based on explicit and implicit features from social media

Contact Info

Product

Resources

About