2015
DOI: 10.1103/physrevx.5.011007
|View full text |Cite
|
Sign up to set email alerts
|

High-Reproducibility and High-Accuracy Method for Automated Topic Classification

Abstract: Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA oft… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
84
0
3

Year Published

2015
2015
2023
2023

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 76 publications
(91 citation statements)
references
References 33 publications
(39 reference statements)
0
84
0
3
Order By: Relevance
“…To overcome this limitation of the journal-level analysis, we must determine the research topic of each publication at a finer scale. To this end, we use a highly accurate and reproducible topic classification algorithm to identify the topics of publications [35]. We identify a total of 69 topics using the titles and abstracts from the set of 61,116 publications by molecular biology faculty in our database.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…To overcome this limitation of the journal-level analysis, we must determine the research topic of each publication at a finer scale. To this end, we use a highly accurate and reproducible topic classification algorithm to identify the topics of publications [35]. We identify a total of 69 topics using the titles and abstracts from the set of 61,116 publications by molecular biology faculty in our database.…”
Section: Resultsmentioning
confidence: 99%
“… Topic represents the topic number identified by the topic classification algorithm and is field-specific [35]; Outlier topic represents the topic in Fig 5.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…In another work, Tumminello et al (2011) used the hypergeometric distribution and measured the P-value for each subset of the bipartite network. Moreover, Lancichinetti et al (2015) proposed a community detection method to classify topics to articles more efficiently, and Serrano et al (2009) used a disparity filtering method to infer significant weights in networks. Finally, Ronen et al (2014) adopted a statistical approach to determine significant links between languages in various written documents.…”
Section: Methodsmentioning
confidence: 99%
“…Due to the high degeneracy of the likelihood landscape, standard optimization algorithms will more likely infer different models after different optimization runs than infer the model with the highest likelihood,as has been previously reported Blei et al (2010a); Wallach et al (2009). A research on the validity of LDA optimization algorithms for inferring topic models also proposed that current implementations of LDA had low validity (Lancichinetti et al, 2015).…”
Section: Introductionmentioning
confidence: 99%