Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.135
|View full text |Cite
|
Sign up to set email alerts
|

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!

Abstract: Topic models are a useful analysis tool to uncover the underlying themes within document collections. The dominant approach is to use probabilistic topic models that posit a generative story, but in this paper we propose an alternative way to obtain topics: clustering pretrained word embeddings while incorporating document information for weighted clustering and reranking top words. We provide benchmarks for the combination of different word embeddings and clustering algorithms, and analyse their performance u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

3
59
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 69 publications
(63 citation statements)
references
References 19 publications
(9 reference statements)
3
59
0
1
Order By: Relevance
“…The topic of this paper is related to the more fundamental question of how PLMs represent the meaning of complex words in the first place. So far, most studies have focused on methods of representation extraction, using ad-hoc heuristics such as averaging the subword embeddings (Pinter et al, 2020;Sia et al, 2020; or taking the first subword embedding (Devlin et al, 2019;Heinzerling and Strube, 2019;Martin et al, 2020). While not resolving the issue, we lay the theoretical groundwork for more systematic analyses by showing that PLMs can be regarded as serial dual-route models (Caramazza et al, 1988), i.e., the meanings of complex words are either stored or else need to be computed from the subwords.…”
Section: Introductionmentioning
confidence: 99%
“…The topic of this paper is related to the more fundamental question of how PLMs represent the meaning of complex words in the first place. So far, most studies have focused on methods of representation extraction, using ad-hoc heuristics such as averaging the subword embeddings (Pinter et al, 2020;Sia et al, 2020; or taking the first subword embedding (Devlin et al, 2019;Heinzerling and Strube, 2019;Martin et al, 2020). While not resolving the issue, we lay the theoretical groundwork for more systematic analyses by showing that PLMs can be regarded as serial dual-route models (Caramazza et al, 1988), i.e., the meanings of complex words are either stored or else need to be computed from the subwords.…”
Section: Introductionmentioning
confidence: 99%
“…News event tracking has also been framed as a non-parametric topic modeling problem (Zhou et al, 2015) and HDPs that share parameters across temporal batches have been used for this task (Beykikhoshk et al, 2018). Dense document representations have been shown to be useful in the parametric variant of our problem, with neural LDA (Dieng et al, 2019a;Keya et al, 2019;Dieng et al, 2019b;Bianchi et al, 2020), temporal topic evolution models (Zaheer et al, 2017;Gupta et al, 2018;Zaheer et al, 2019;Brochier et al, 2020) and embedding space clustering (Momeni et al, 2018;Sia et al, 2020) being some prominent approaches in the literature.…”
Section: Related Workmentioning
confidence: 99%
“…However our problem setting is completely different, we extract topics from documents in unsupervised way where document links/metadata/labels either don't exist or are not used to extract the topics. Some very recent works use pre-trained BERT (Devlin et al, 2019) either to leverage improved text representations (Bianchi et al, 2020;Sia et al, 2020) or to augment topic model through knowledge distillation (Hoyle et al, 2020a). Zhu et al (2020) and Dieng et al (2020) jointly train words and topics in a shared embedding space.…”
Section: Introductionmentioning
confidence: 99%