Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality

Sáez, Carlos; Rodrigues, Pedro Pereira; Gama, João; Robles, Montserrat; García‐Gómez, Juan Miguel

doi:10.1007/s10618-014-0378-6

Cited by 24 publications

(36 citation statements)

References 45 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The methods used in the present study fall into two groups, namely those for assessing multisource variability [18] and those for assessing temporal variability. [19] The methods are based on the comparison of probability distributions of the variables among different sources or over different periods of time. The comparisons are made by calculating the information-theoretic probabilistic distances between pairs of distributions, in concrete terms, we use the Jensen-Shannon distance (JSD), a symmetrized and smoothed version of the Kullback-Leibler divergence.…”

Section: Methodsmentioning

confidence: 99%

“…Multi-source or temporal variability, if unmanaged, may lead to inaccurate or irreproducible results [3,18,19] or even to invalid results. [11] The reuse of data in multi-site repositories for population studies, clinical trials, or data mining rests on the assumption that the data distributions are to some degree concordant irrespective of the source of data or of the time over which the data have been collected and therefore allows generalizable conclusions to be drawn from the data.…”

Section: Background and Significancementioning

confidence: 99%

“…[12] Meeting the requirements mentioned above, we developed two sets of methods for both multi-source and temporal variability assessment. [18,19] These methods allow variability to be measured and visually explored, by comparing the probability distributions of the data using information-theoretic metrics. These methods have been evaluated earlier using simulated problems as well as real registries, including the UCI Heart Disease public dataset [26] and the US NHDS data.…”

Section: Background and Significancementioning

confidence: 99%

See 2 more Smart Citations

Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories

Sáez

Zurriaga

Pérez-Panadés

et al. 2016

Journal of the American Medical Informatics Association

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Background and Significancementioning

confidence: 99%

Section: Background and Significancementioning

confidence: 99%

See 1 more Smart Citation

Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories

Sáez

Zurriaga

Pérez-Panadés

et al. 2016

Journal of the American Medical Informatics Association

Self Cite

View full text Add to dashboard Cite

show abstract

“…was not certified by peer review) (which The copyright holder for this preprint this version posted September 16, 2019. . https://doi.org/10.1101/19006098 doi: medRxiv preprint A PREPRINT -SEPTEMBER 5, 2019 framework can easily be extended to allow for a more formal statistical process control (see Sáez et al (2015 and [17,19] for guidance).…”

Section: Limitationsmentioning

confidence: 99%

Data-driven discovery of changes in clinical code usage over time: a case-study on changes in cardiovascular disease recording in two English electronic health records databases (2001-2015)

Rockenschaub

Nguyen

Aldridge

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Objectives: To demonstrate how data-driven variability methods can be used to identify changes in disease recording in two English electronic health records databases between 2001-2015. Design: Repeated cross-sectional analysis that applied data-driven temporal variability methods to assess month-by-month changes in routinely-collected medical data. A measure of difference between months was calculated based on joint distributions of age, gender, socio-economic status and recorded cardiovascular diseases. Distances between months were used to identify temporal trends in data recording. Setting: 400 English primary care practices from the Clinical Practice Research Datalink (CPRD GOLD) and 451 hospital trusts from the Hospital Episode Statistics (HES). Main outcomes: The proportion of patients (CPRD GOLD) and hospital admissions (HES) with a recorded cardiovascular disease (CPRD GOLD: coronary heart disease, heart failure, peripheral arterial disease, stroke; HES: International Classification of Disease ICD codes I20-I69/G45). Results: Both databases showed gradual changes in cardiovascular disease recording between 2001 and 2008. The recorded prevalence of included cardiovascular diseases in CPRD GOLD increased by 47%-62%, which partially reversed after 2008. For hospital records in HES, there was a relative decrease in angina pectoris (-34.4%) and unspecified stroke (-42.3%) over the same time period, with a concomitant increase in chronic coronary heart disease (+14.3%). Multiple abrupt changes in the use of myocardial infarction codes in hospital were found in March/April 2010, 2012 and 2014, possibly linked to updates of clinical coding guidelines. Conclusions: Identified temporal variability could be related to potentially non-medical causes such as updated coding guidelines. These artificial changes may introduce temporal correlation among diagnoses inferred from routine data, violating the assumptions of frequently used statistical methods. Temporal variability measures provided an objective and robust technique to identify, and subsequently account for, those changes in electronic health records studies without any prior knowledge of the data collection process.

show abstract

“…Larger search spaces, like those encountered for numerical data, (complex) multi‐relational datasets, for example, encountered in social networks, or spatiotemporal data require efficient algorithms that can handle those different types of data, e.g., Refs . Also combinations of such different data characteristics, for example, temporal pattern mining for event detection, or temporal subgroup analytics provide further challenges, especially considering sophisticated exceptional model classes in that area.…”

Section: Future Directions and Challengesmentioning

confidence: 99%

Subgroup discovery

Atzmueller

2015

WIREs Data Min & Knowl

185

102

View full text Add to dashboard Cite

Subgroup discovery is a broadly applicable descriptive data mining technique for identifying interesting subgroups according to some property of interest. This article summarizes fundamentals of subgroup discovery, before that it also reviews algorithms and further advanced methodological issues. In addition, we briefly discuss tools and applications of subgroup discovery approaches. In that context, we also discuss experiences and lessons learned and outline some of the future directions in order to show the advantages and benefits of subgroup discovery.

show abstract

Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality

Cited by 24 publications

References 45 publications

Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories

Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories

Data-driven discovery of changes in clinical code usage over time: a case-study on changes in cardiovascular disease recording in two English electronic health records databases (2001-2015)

Subgroup discovery

Contact Info

Product

Resources

About