Redundancy-Aware Topic Modeling for Patient Record Notes

Cohen, Raphael; Aviram, Iddo; Elhadad, Michael; Elhadad, Noémie

doi:10.1371/journal.pone.0087555

Cited by 55 publications

(35 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These models have been largely discussed for general corpora (e.g., newspaper articles), and have been developed for many uses, including word-sense disambiguation [13], topic correlation [14], learning information hierarchies [15], and tracking themes over time [16, 17]. In the biomedical domain, work has investigated the use of topic models to evaluate the impact of copy and pasted text on topic learning [18], better understanding and predicting Medical Subject Headings (MeSH) applied to PubMed articles [19], and exploring the correlation between Federal Drug Administration (FDA) research priorities and topics in research articles funded under those priorities [20]. Recently, topic models have been employed in the clinical domain in problems such as cased-based retrieval [21]; characterizing clinical concepts over time [22]; and predicting patient satisfaction [23], depression [24], infection [25], and mortality [26].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating topic model interpretability from a primary care physician perspective

Arnold

Chen

et al. 2016

Computer Methods and Programs in Biomedicine

View full text Add to dashboard Cite

Background and Objective Probabilistic topic models provide an unsupervised method for analyzing unstructured text. These models discover semantically coherent combinations of words (topics) that could be integrated in a clinical automatic summarization system for primary care physicians performing chart review. However, the human interpretability of topics discovered from clinical reports is unknown. Our objective is to assess the coherence of topics and their ability to represent the contents of clinical reports from a primary care physician’s point of view. Methods Three latent Dirichlet allocation models (50 topics, 100 topics, and 150 topics) were fit to a large collection of clinical reports. Topics were manually evaluated by primary care physicians and graduate students. Wilcoxon Signed-Rank Tests for Paired Samples were used to evaluate differences between different topic models, while differences in performance between students and primary care physicians (PCPs) were tested using Mann-Whitney U tests for each of the tasks. Results While the 150-topic model produced the best log likelihood, participants were most accurate at identifying words that did not belong in topics learned by the 100-topic model, suggesting that 100 topics provides better relative granularity of discovered semantic themes for the data set used in this study. Models were comparable in their ability to represent the contents of documents. Primary care physicians significantly outperformed students in both tasks. Conclusion This work establishes a baseline of interpretability for topic models trained with clinical reports, and provides insights on the appropriateness of using topic models for informatics applications. Our results indicate that PCPs find discovered topics more coherent and representative of clinical reports relative to students, warranting further research into their use for automatic summarization.

show abstract

Section: Introductionmentioning

confidence: 99%

“…A topic may then be sampled from the topic multinomial, which indexes individual topics from which words are drawn to generate documents. The inclusion of a Dirichlet prior has the benefit of mitigating overfitting, which is a limitation of PLSI [18]. …”

Section: Introductionmentioning

confidence: 99%

Evaluating topic model interpretability from a primary care physician perspective

Arnold

Chen

et al. 2016

Computer Methods and Programs in Biomedicine

View full text Add to dashboard Cite

show abstract

“…[7,8] In the clinical domain, work has investigated the use of topic models in cased-based retrieval,[1] characterizing clinical concepts over time,[9] and the impact of copy and pasted text on topic learning. [10] Topics have also been used as features in classifiers in order to predict patient satisfaction,[11] depression,[12] infection,[13] and mortality. [14]…”

Section: Introductionmentioning

confidence: 99%

Using phrases and document metadata to improve topic modeling of clinical reports

Speier¹,

Ong²,

Arnold³

2016

Journal of Biomedical Informatics

View full text Add to dashboard Cite

Probabilistic topic models provide an unsupervised method for analyzing unstructured text, which have the potential to be integrated into clinical automatic summarization systems. Clinical documents are accompanied by metadata in a patient’s medical history and frequently contains multiword concepts that can be valuable for accurately interpreting the included text. While existing methods have attempted to address these problems individually, we present a unified model for free-text clinical documents that integrates contextual patient- and document-level data, and discovers multi-word concepts. In the proposed model, phrases are represented by chained n-grams and a Dirichlet hyper-parameter is weighted by both document-level and patient-level context. This method and three other Latent Dirichlet allocation models were fit to a large collection of clinical reports. Examples of resulting topics demonstrate the results of the new model and the quality of the representations are evaluated using empirical log likelihood. The proposed model was able to create informative prior probabilities based on patient and document information, and captured phrases that represented various clinical concepts. The representation using the proposed model had a significantly higher empirical log likelihood than the compared methods. Integrating document metadata and capturing phrases in clinical text greatly improves the topic representation of clinical documents. The resulting clinically informative topics may effectively serve as the basis for an automatic summarization system for clinical reports.

show abstract

“…Even though the goal of anonymization proposals is not to analyze or prioritize EMRs, the NLP techniques applied to recognize identifiers and quasi-identifiers are very close to the process of identifying entities for other purposes. Other existing research for EMR text pre-processing has demonstrated the possibility of extracting temporal expressions [32], [33]; correcting misspelled words [34]; resolving existing coreferences [35]; eliminating redundancy [36] and generating summaries [37].…”

Section: Background and Related Workmentioning

confidence: 99%

A Strategy for Prioritizing Electronic Medical Records Using Structured Analysis and Natural Language Processing

Quimbaya

González

Velandía

et al. 2017

IyU

View full text Add to dashboard Cite

Objective: Electronic medical records (EMR) typically contain both structured attributes as well as narrative text. The usefulness of EMR for research and administration is hampered by the difficulty in automatically analyzing their narrative portions. Accordingly, this paper proposes SPIRE, a strategy for prioritizing EMR, using natural language processing in combination with analysis of structured data, in order to identify and rank EMR that match specific queries from clinical researchers and health administrators. Materials and Methods: The resulting software tool was evaluated technically and validated with three cases (heart failure, pulmonary hypertension and diabetes mellitus) compared against expert obtained results. Results and Discussion: Our preliminary results show high sensitivity (70%, 82% and 87% respectively) and specificity (85%, 73.7% and 87.5%) in the resulting set of records. The AUC was between 0.84 and 0.9. Conclusions: SPIRE was successfully implemented and used in the context of a university hospital information system, enabling clinical researchers to obtain prioritized EMR to solve their information needs through collaborative search templates with faster and more accurate results than other existing methods.

show abstract

Redundancy-Aware Topic Modeling for Patient Record Notes

Cited by 55 publications

References 16 publications

Evaluating topic model interpretability from a primary care physician perspective

Evaluating topic model interpretability from a primary care physician perspective

Using phrases and document metadata to improve topic modeling of clinical reports

A Strategy for Prioritizing Electronic Medical Records Using Structured Analysis and Natural Language Processing

Contact Info

Product

Resources

About