Improving Topic Models with Latent Feature Word                     Representations

Nguyen, Dat Quoc; Billingsley, Richard; Du, Lan; Johnson, Mark

doi:10.1162/tacl_a_00140

Cited by 267 publications

(173 citation statements)

References 29 publications

(40 reference statements)

Supporting

Mentioning

159

Contrasting

Unclassified

Order By: Relevance

“…Similarly, "Brisbane" (in Australia) is usually abbreviated to "BNE" and called as "Brissie". As a result, current text mining approaches (e.g., topic modeling [13] [14] and other heuristics [15] [16]) may not gain sufficient statistical signals and mismatch the textual contents of the similar authors. Consequently, the correlation edge weight between the pair of authors will be calculated incorrectly.…”

Section: Challenge 1 (Mismatched Author Contents)mentioning

confidence: 99%

“…Vector representation has other types: Paragraph2Vec [37], ConceptVector [30], Category2Vec [38], Prod2Vec [31]. Moreover, [13] includes topic models to collectively generate a word from either Dirichlet multinominal or the embedding module. [38] enriches the embedding with Knowledge Graphs to eliminate ambiguity and improve similarity measures.…”

Section: Word Embeddingmentioning

confidence: 99%

See 1 more Smart Citation

SoulMate: Short-Text Author Linking Through Multi-Aspect Temporal-Textual Embedding

Najafipour

Hosseini

Hua

et al. 2022

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Linking authors of short-text contents has important usages in many applications, including Named Entity Recognition (NER) and human community detection. However, certain challenges lie ahead. Firstly, the input short-text contents are noisy, ambiguous, and do not follow the grammatical rules. Secondly, traditional text mining methods fail to effectively extract concepts through words and phrases. Thirdly, the textual contents are temporally skewed, which can affect the semantic understanding by multiple time facets. Finally, using the complementary knowledge-bases makes the results biased to the content of the external database and deviates the understanding and interpretation away from the real nature of the given short text corpus. To overcome these challenges, we devise a neural network-based temporal-textual framework that generates the tightly connected author subgraphs from microblog short-text contents. Our approach, on the one hand, computes the relevance score (edge weight) between the authors through considering a portmanteau of contents and concepts, and on the other hand, employs a stack-wise graph cutting algorithm to extract the communities of the related authors. Experimental results show that compared to other knowledge-centered competitors, our multi-aspect vector space model can achieve a higher performance in linking short-text authors. Additionally, given the author linking task, the more comprehensive the dataset is, the higher the significance of the extracted concepts will be.

show abstract

Section: Challenge 1 (Mismatched Author Contents)mentioning

confidence: 99%

Section: Word Embeddingmentioning

confidence: 99%

SoulMate: Short-Text Author Linking Through Multi-Aspect Temporal-Textual Embedding

Najafipour

Hosseini

Hua

et al. 2022

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

“…Chen et al developed the MDK-LDA variant on LDA which takes into account domain knowledge directly to provide better topic descriptors [7]. Furthermore, approaches that combine word embeddings with topic modeling can be beneficial for learning both models jointly [42], as well as improving topic model representations for short texts through word embeddings [36,43,58], or creating improved word embeddings using LDA [46].…”

Section: Semantic Interactionmentioning

confidence: 99%

Semantic Concept Spaces: Guided Topic Model Refinement using Word-Embedding Projections

El‐Assady

Kehlbeck

Collins

et al. 2019

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

Fig. 1: Guided relevance feedback for the targeted refinement of incoherent areas in the Semantic Concept Space. This user guidance component tours through the space and highlights potentially uncertain areas, suggesting a recommended action for refinement.Abstract-We present a framework that allows users to incorporate the semantics of their domain knowledge for topic model refinement while remaining model-agnostic. Our approach enables users to (1) understand the semantic space of the model, (2) identify regions of potential conflicts and problems, and (3) readjust the semantic relation of concepts based on their understanding, directly influencing the topic modeling. These tasks are supported by an interactive visual analytics workspace that uses word-embedding projections to define concept regions which can then be refined. The user-refined concepts are independent of a particular document collection and can be transferred to related corpora. All user interactions within the concept space directly affect the semantic relations of the underlying vector space model, which, in turn, change the topic modeling. In addition to direct manipulation, our system guides the users' decisionmaking process through recommended interactions that point out potential improvements. This targeted refinement aims at minimizing the feedback required for an efficient human-in-the-loop process. We confirm the improvements achieved through our approach in two user studies that show topic model quality improvements through our visual knowledge externalization and learning process.

show abstract

“…Specifically, in the document classification task, topics are used as features of documents with values P (t | d). These features are used for training a classifier [7,16,17]. In the document clustering task, each topic is considered a cluster and each document is assigned to its most probable topic [16,18].…”

Section: Evaluating Topic Modelsmentioning

confidence: 99%

“…These features are used for training a classifier [7,16,17]. In the document clustering task, each topic is considered a cluster and each document is assigned to its most probable topic [16,18]. For the analyses in Section 7, following common practice (e.g., [16,19,20]), we use Purity and Normalized Mutual Information in the clustering task, and Accuracy as our prime evaluation metric in the classification task.…”

Section: Evaluating Topic Modelsmentioning

confidence: 99%

HiTR: Hierarchical Topic Model Re-Estimation for Measuring Topical Diversity of Documents

Azarbonyad

Dehghani

Kenter

et al. 2019

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three distributions for assessing the diversity of documents: distributions of words within documents, words within topics, and topics within documents. Topic models play a central role in this approach and, hence, their quality is crucial to the efficacy of measuring topical diversity. The quality of topic models is affected by two causes: generality and impurity of topics. General topics only include common information of a background corpus and are assigned to most of the documents. Impure topics contain words that are not related to the topic. Impurity lowers the interpretability of topic models. Impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation process aimed at removing generality and impurity. Our approach has three re-estimation components: (1) document re-estimation, which removes general words from the documents; (2) topic re-estimation, which re-estimates the distribution over words of each topic; and (3) topic assignment re-estimation, which re-estimates for each document its distributions over topics. For measuring topical diversity of text documents, our HiTR approach improves over the state-of-the-art measured on PubMed dataset.

show abstract

Improving Topic Models with Latent Feature Word Representations

Cited by 267 publications

References 29 publications

SoulMate: Short-Text Author Linking Through Multi-Aspect Temporal-Textual Embedding

SoulMate: Short-Text Author Linking Through Multi-Aspect Temporal-Textual Embedding

Semantic Concept Spaces: Guided Topic Model Refinement using Word-Embedding Projections

HiTR: Hierarchical Topic Model Re-Estimation for Measuring Topical Diversity of Documents

Contact Info

Product

Resources

About