Short Text Document Clustering using Distributed Word Representation and Document Distance

Kongwudhikunakorn, Supavit; Waiyamai, Kitsana

doi:10.48048/wjst.2019.4133

Cited by 2 publications

(2 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, Haj-Yahia et al 38 and Schopf et al 39 semantically matched text to classification labels for unsupervised text classification. Meanwhile, Kongwudhikunakorn et al 40 used word embeddings and the Word Mover’s Distance 41 to accurately cluster documents.…”

Section: Discussionmentioning

confidence: 99%

Generalization of finetuned transformer language models to new clinical contexts

Xie,

Terman,

Gallagher

et al. 2023

JAMIA Open

View full text Add to dashboard Cite

Objective We have previously developed a natural language processing pipeline using clinical notes written by epilepsy specialists to extract seizure freedom, seizure frequency text, and date of last seizure text for patients with epilepsy. It is important to understand how our methods generalize to new care contexts. Materials and methods We evaluated our pipeline on unseen notes from nonepilepsy-specialist neurologists and non-neurologists without any additional algorithm training. We tested the pipeline out-of-institution using epilepsy specialist notes from an outside medical center with only minor preprocessing adaptations. We examined reasons for discrepancies in performance in new contexts by measuring physical and semantic similarities between documents. Results Our ability to classify patient seizure freedom decreased by at least 0.12 agreement when moving from epilepsy specialists to nonspecialists or other institutions. On notes from our institution, textual overlap between the extracted outcomes and the gold standard annotations attained from manual chart review decreased by at least 0.11 F1 when an answer existed but did not change when no answer existed; here our models generalized on notes from the outside institution, losing at most 0.02 agreement. We analyzed textual differences and found that syntactic and semantic differences in both clinically relevant sentences and surrounding contexts significantly influenced model performance. Discussion and conclusion Model generalization performance decreased on notes from nonspecialists; out-of-institution generalization on epilepsy specialist notes required small changes to preprocessing but was especially good for seizure frequency text and date of last seizure text, opening opportunities for multicenter collaborations using these outcomes.

show abstract

Section: Discussionmentioning

confidence: 99%

Generalization of finetuned transformer language models to new clinical contexts

Xie,

Terman,

Gallagher

et al. 2023

JAMIA Open

View full text Add to dashboard Cite

show abstract

“…The results of experiments conducted on Turkish tweets by using word embeddings are compared with the results where TF-IDF representations are used. Kongwudhikunakorn and Waiyamai (2020) propose a combination of document representation, document distance measure and a document clustering method in order to improve performance in short text clustering. The method includes (1) distributed representation of words for document representation (Mikolov, Sutskever, et al, 2013;, (2) Word Mover's Distance as the document distance metric (Kusner et al, 2015), and (3) K-means algorithm for document clustering (MacQueen, 1967).…”

Section: Short Text Clustering: Recent Developments For Batch Processingmentioning

confidence: 99%

Recent methods on short text stream clustering: A survey study

Maden

Karagöz

2023

WIREs Computational Stats

View full text Add to dashboard Cite

The volume and the velocity of data in social media are increasing and the social media has become a very useful environment to detect and track the real‐world events. However, to fulfill this, it is crucial to group‐related texts according to their topics and clustering takes an essential role at this point since we have no prior knowledge about the topics and their evolution in social media. In this survey, we review the current approaches and techniques proposed for short text stream clustering in recent years. The reviewed techniques are grouped according to their methodology and discussed in detail. Also, the datasets utilized to evaluate the performance of the proposed methods and the results are summarized together with the clustering quality measures used for these evaluations. Furthermore, current challenges about short‐text stream clustering are discussed.This article is categorized under: Data: Types and Structure > Streaming Data

show abstract

Short Text Document Clustering using Distributed Word Representation and Document Distance

Cited by 2 publications

References 30 publications

Generalization of finetuned transformer language models to new clinical contexts

Generalization of finetuned transformer language models to new clinical contexts

Recent methods on short text stream clustering: A survey study

Contact Info

Product

Resources

About