An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering

Kumar, Jay; Shao, Jun; Uddin, Salah; Ali, Wazir

doi:10.18653/v1/2020.acl-main.70

Cited by 28 publications

(7 citation statements)

References 33 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The CF vectors are updated for adding a new document to a cluster or removing a document from the cluster. These operations are called addition/deletion property (Kumar, Shao, et al 2020) and addible/deletable property (Qiang et al 2021). Here, a CF vector contains the list of biterms, the number of documents and the number of words in a cluster.…”

Section: Dirichlet Process-based Methods With Co-occurrencementioning

confidence: 99%

“…In addition to short text datasets obtained from sources such as Google News or Twitter, there are several special datasets used in proposed methods such as CrisisLex 6 . This dataset contains 26 crisis or disaster events occurred 2009–2014 around the world and it is used by Kumar, Shao, et al (2020). The study by Najafi et al (2022) uses datasets containing tweets about specific events.…”

Section: Datasetsmentioning

confidence: 99%

“…In addition to these commonly used metrics, other quality measures such as V-Measure (Kumar, Shao, et al, 2020), Adjusted Rand Index (J. Zhang, Liu, et al, 2021), f1-score (Ozdikis et al, 2017), precision, and recall (Najafi et al, 2022) are also used by the researchers to compare the performance of proposed methods with the state-of-the-art methods.…”

Section: Clustering Quality Measuresmentioning

confidence: 99%

See 2 more Smart Citations

Recent methods on short text stream clustering: A survey study

Maden

Karagöz

2023

WIREs Computational Stats

View full text Add to dashboard Cite

The volume and the velocity of data in social media are increasing and the social media has become a very useful environment to detect and track the real‐world events. However, to fulfill this, it is crucial to group‐related texts according to their topics and clustering takes an essential role at this point since we have no prior knowledge about the topics and their evolution in social media. In this survey, we review the current approaches and techniques proposed for short text stream clustering in recent years. The reviewed techniques are grouped according to their methodology and discussed in detail. Also, the datasets utilized to evaluate the performance of the proposed methods and the results are summarized together with the clustering quality measures used for these evaluations. Furthermore, current challenges about short‐text stream clustering are discussed.This article is categorized under: Data: Types and Structure > Streaming Data

show abstract

Section: Dirichlet Process-based Methods With Co-occurrencementioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Section: Clustering Quality Measuresmentioning

confidence: 99%

See 1 more Smart Citation

Recent methods on short text stream clustering: A survey study

Maden

Karagöz

2023

WIREs Computational Stats

View full text Add to dashboard Cite

show abstract

“…The overwhelming amount of data produced on social media platforms justifies stream clustering having its own discipline in the machine learning research community, a discipline focused on the development of new ideas and algorithms in an unsupervised setting. In recent years, several new algorithms have been proposed by the community, each with its own advantages and disadvantages (Carnein et al, 2017;Kumar et al, 2020;Yin et al, 2018). A fundamental problem that not only hinders the acceptance but also the progress of such new ideas is the limited capacity for replicated evaluation, which is due to a lack of both accessible data sets and algorithm implementations.…”

Section: Stream Clusteringmentioning

confidence: 99%

Benchmarking Crisis in Social Media Analytics: A Solution for the Data-Sharing Problem

Assenmacher

Weber

Preuß

et al. 2021

Social Science Computer Review

View full text Add to dashboard Cite

Computational social science uses computational and statistical methods in order to evaluate social interaction. The public availability of data sets is thus a necessary precondition for reliable and replicable research. These data allow researchers to benchmark the computational methods they develop, test the generalizability of their findings, and build confidence in their results. When social media data are concerned, data sharing is often restricted for legal or privacy reasons, which makes the comparison of methods and the replicability of research results infeasible. Social media analytics research, consequently, faces an integrity crisis. How is it possible to create trust in computational or statistical analyses, when they cannot be validated by third parties? In this work, we explore this well-known, yet little discussed, problem for social media analytics. We investigate how this problem can be solved by looking at related computational research areas. Moreover, we propose and implement a prototype to address the problem in the form of a new evaluation framework that enables the comparison of algorithms without the need to exchange data directly, while maintaining flexibility for the algorithm design.

show abstract

“…Traditional clustering algorithms such as K-means algorithm [3] have achieved great results, but when acting on text data with sparse features, these methods are easy to converge in advance [4], resulting in unsatisfactory clustering results. Topic models can usually represent document as multinomial distribution of topics, so some researches directly apply topic models to text clustering tasks and achieve great results [5]. However, for case-involved news, there are many similar but different cases due to the proximity of topics and types.…”

Section: Introductionmentioning

confidence: 99%

A Clustering Method of Case-Involved News by Combining Topic Network and Multi-Head Attention Mechanism

Mao

Liang

et al. 2021

Sensors

View full text Add to dashboard Cite

Finding the news of same case from the large numbers of case-involved news is an important basis for public opinion analysis. Existing text clustering methods usually based on topic models which only use topic and case infomation as the global features of documents, so distinguishing between different cases with similar types remains a challenge. The contents of documents contain rich local features. Taking into account the internal features of news, the information of cases and the contributions provided by different topics, we propose a clustering method of case-involved news, which combines topic network and multi-head attention mechanism. Using case information and topic information to construct a topic network, then extracting the global features by graph convolution network, thus realizing the combination of case information and topic information. At the same time, the local features are extracted by multi-head attention mechanism. Finally, the fusion of global features and local features is realized by variational auto-encoder, and the learned latent representations are used for clustering. The experiments show that the proposed method significantly outperforms the state-of-the-art unsupervised clustering methods.

show abstract

An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering

Cited by 28 publications

References 33 publications

Recent methods on short text stream clustering: A survey study

Recent methods on short text stream clustering: A survey study

Benchmarking Crisis in Social Media Analytics: A Solution for the Data-Sharing Problem

A Clustering Method of Case-Involved News by Combining Topic Network and Multi-Head Attention Mechanism

Contact Info

Product

Resources

About