Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.70
|View full text |Cite
|
Sign up to set email alerts
|

An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering

Abstract: Clustering short text streams is a challenging task due to its unique properties: infinite length, sparse data representation and cluster evolution. Existing approaches often exploit short text streams in a batch way. However, determine the optimal batch size is usually a difficult task since we have no prior knowledge when the topics evolve. In addition, traditional independent word representation in the graphical model tends to cause "term ambiguity" problem in short text clustering. Therefore, in this paper… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(7 citation statements)
references
References 33 publications
(48 reference statements)
0
7
0
Order By: Relevance
“…The CF vectors are updated for adding a new document to a cluster or removing a document from the cluster. These operations are called addition/deletion property (Kumar, Shao, et al 2020) and addible/deletable property (Qiang et al 2021). Here, a CF vector contains the list of biterms, the number of documents and the number of words in a cluster.…”
Section: Dirichlet Process-based Methods With Co-occurrencementioning
confidence: 99%
See 2 more Smart Citations
“…The CF vectors are updated for adding a new document to a cluster or removing a document from the cluster. These operations are called addition/deletion property (Kumar, Shao, et al 2020) and addible/deletable property (Qiang et al 2021). Here, a CF vector contains the list of biterms, the number of documents and the number of words in a cluster.…”
Section: Dirichlet Process-based Methods With Co-occurrencementioning
confidence: 99%
“…In addition to short text datasets obtained from sources such as Google News or Twitter, there are several special datasets used in proposed methods such as CrisisLex 6 . This dataset contains 26 crisis or disaster events occurred 2009–2014 around the world and it is used by Kumar, Shao, et al (2020). The study by Najafi et al (2022) uses datasets containing tweets about specific events.…”
Section: Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…The overwhelming amount of data produced on social media platforms justifies stream clustering having its own discipline in the machine learning research community, a discipline focused on the development of new ideas and algorithms in an unsupervised setting. In recent years, several new algorithms have been proposed by the community, each with its own advantages and disadvantages (Carnein et al, 2017;Kumar et al, 2020;Yin et al, 2018). A fundamental problem that not only hinders the acceptance but also the progress of such new ideas is the limited capacity for replicated evaluation, which is due to a lack of both accessible data sets and algorithm implementations.…”
Section: Stream Clusteringmentioning
confidence: 99%
“…Traditional clustering algorithms such as K-means algorithm [3] have achieved great results, but when acting on text data with sparse features, these methods are easy to converge in advance [4], resulting in unsatisfactory clustering results. Topic models can usually represent document as multinomial distribution of topics, so some researches directly apply topic models to text clustering tasks and achieve great results [5]. However, for case-involved news, there are many similar but different cases due to the proximity of topics and types.…”
Section: Introductionmentioning
confidence: 99%