Text segmentation via topic modeling

Misra, Hemant; Yvon, François; Jose, Joemon M.; Cappé, Olivier

doi:10.1145/1645953.1646170

Cited by 72 publications

(65 citation statements)

References 13 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For the evaluation on the Choi dataset, the GRAPHSEG algorithm made use of the publicly available word embeddings built from a Google News dataset. 4 Both LDA-based models (Misra et al, 2009;Riedl and Biemann, 2012) and GRAPHSEG rely on corpus-derived word representations. Thus, we evaluated on the Manifesto dataset both the domainadapted and domain-unadapted variants of these methods.…”

Section: Experimental Settingmentioning

confidence: 99%

Unsupervised Text Segmentation Using Semantic Relatedness Graphs

Glavaš¹,

Nanni²,

Ponzetto³

2016

Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

View full text Add to dashboard Cite

Segmenting text into semantically coherent fragments improves readability of text and facilitates tasks like text summarization and passage retrieval. In this paper, we present a novel unsupervised algorithm for linear text segmentation (TS) that exploits word embeddings and a measure of semantic relatedness of short texts to construct a semantic relatedness graph of the document. Semantically coherent segments are then derived from maximal cliques of the relatedness graph. The algorithm performs competitively on a standard synthetic dataset and outperforms the best-performing method on a real-world (i.e., non-artificial) dataset of political manifestos.

show abstract

Section: Experimental Settingmentioning

confidence: 99%

Unsupervised Text Segmentation Using Semantic Relatedness Graphs

Glavaš¹,

Nanni²,

Ponzetto³

2016

Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

View full text Add to dashboard Cite

show abstract

“…(Sun, Li, Luo& Wu, 2008;Zhang, Kang, Qian& Huang, 2014;Rangel, Faria, Lima & Oliveira, 2016) use LDA on a corpus of segments, inter-segment cipher similarities via a Fisher kernel, and optimize segmentation via dynamic programming. (Misra, Yvon, Jose, & Cappe, 2009;Glavaš, Nanni & Ponzetto, 2016) use a document-level LDA model, treat sections as new documents and predict their LDA models, and so do segmentation via dynamic programming with probabilistic scores. It is together a challenge to look out the useful data from the large documents (Aggarwal & Zhai, 2012;Zhai, & Massung, 2016).…”

Section: Background and Related Workmentioning

confidence: 99%

“…The traditional document cluster unit high-dimensional about texts. (Misra et al, 2009;Glavaš, Nanni & Ponzetto, 2016). The presence of logical structure clues within the document, scientific criteria and applied math similarity measures chiefly accustomed figure thematically coherent, contiguous text blocks in unstructured documents (Sun et al, 2008;Zhang et al, 2014;Rangel et al, 2016).…”

Section: Background and Related Workmentioning

confidence: 99%

Design and Develop Semantic Textual Document Clustering Model

Fahad¹

2017

JCSIT

View full text Add to dashboard Cite

The utilization of textual documents is spontaneously increasing over the internet, email, web pages, reports, journals, articles and they stored in the electronic database format. It is challenging to find and access these documents without proper classification mechanisms. To overcome such difficulties we proposed a semantic document clustering model and develop this model. The document pre-processing steps, semantic information from WordNet help us to be bioavailable the semantic relation from raw text. By reminding the limitation of traditional clustering algorithms on the natural language, we consider semantic clustering by COBWEB conceptual clustering. Clustering quality and high accuracy were one of the most important aims of our research, and we chose F-Measure evaluation for ensuring the purity of clustering. However, there still exist many challenges, like the word, high spatial property, extracting core linguistics from texts, and assignment adequate description for the generated clusters. By the help of Word Net database, we eliminate those issues. In this research paper, there have a proposed framework and describe our development evaluation with evaluation.

show abstract

“…The proposed solutions differ widely in the way of calculating the sentence-pair similarity (i.e., topical cohesiveness). Measures based on word co-occurrence [2,9,10] and generative models [1,18,20,23] have been extensively studied. The determination of the segment boundaries may not only be purely based on the local sentence-pair similarities but also be based on the global information derived from the distribution of the lexical similarities of the far neighboring sentences [2,10].…”

Section: Related Workmentioning

confidence: 99%

Exploiting hybrid contexts for Tweet segmentation

Sun

Weng³

et al. 2013

Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Twitter has attracted hundred millions of users to share and disseminate most up-to-date information. However, the noisy and short nature of tweets makes many applications in information retrieval (IR) and natural language processing (NLP) challenging. Recently, segment-based tweet representation has demonstrated effectiveness in named entity recognition (NER) and event detection from tweet streams. To split tweets into meaningful phrases or segments, the previous work is purely based on external knowledge bases, which ignores the rich local context information embedded in the tweets. In this paper, we propose a novel framework for tweet segmentation in a batch mode, called HybridSeg. HybridSeg incorporates local context knowledge with global knowledge bases for better tweet segmentation. HybridSeg consists of two steps: learning from offthe-shelf weak NERs and learning from pseudo feedback. In the first step, the existing NER tools are applied to a batch of tweets. The named entities recognized by these NERs are then employed to guide the tweet segmentation process. In the second step, HybridSeg adjusts the tweet segmentation results iteratively by exploiting all segments in the batch of tweets in a collective manner. Experiments on two tweet datasets show that HybridSeg significantly improves tweet segmentation quality compared with the state-ofthe-art algorithm. We also conduct a case study by using tweet segments for the task of named entity recognition from tweets. The experimental results demonstrate that HybridSeg significantly benefits the downstream applications.

show abstract

Text segmentation via topic modeling

Cited by 72 publications

References 13 publications

Unsupervised Text Segmentation Using Semantic Relatedness Graphs

Unsupervised Text Segmentation Using Semantic Relatedness Graphs

Design and Develop Semantic Textual Document Clustering Model

Exploiting hybrid contexts for Tweet segmentation

Contact Info

Product

Resources

About