Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures.
Health technology assessment (HTA) is the systematic evaluation of the properties and impacts of health technologies and interventions. In this article, we presented a discussion of HTA and its evolution in Brazil, as well as a description of secondary data sources available in Brazil with potential applications to generate evidence for HTA and policy decisions. Furthermore, we highlighted record linkage, ongoing record linkage initiatives in Brazil, and the main linkage tools developed and/or used in Brazilian data. Finally, we discussed the challenges and opportunities of using secondary data for research in the Brazilian context. In conclusion, we emphasized the availability of high quality data and an open, modern attitude toward the use of data for research and policy. This is supported by a rigorous but enabling legal framework that will allow the conduct of large-scale observational studies to evaluate clinical, economical, and social impacts of health technologies and social policies.
The integration of disparate large and heterogeneous socioeconomic and clinical databases is now considered essential to capture and model longitudinal and social aspects of diseases. However, such integration has significant challenges associated with it. Databases are often stored in disparate locations, make use of different identifiers, have variable data quality, record information in bespoke purpose-specific formats and have different levels of associated metadata. Novel computational methods are required to integrate such databases and enable their statistical analyses for clinical research purposes. In this paper, we describe a probabilistic approach for constructing a very large population-based cohort comprised of 114 million individuals using linkages between clinical databases from the National Health System and other administrative databases from various government entities in order to facilitate epidemiological research. We discuss and evaluate the design and validation of our data integration model and probabilistic data linkage methods for creating research data marts that can be statistically analyzed.
Background and aimsA cooperation Brazil-UK was set in mid-2013 aiming at to build a huge cohort comprised by individuals registered in CadastroÚnico (CADU), a socioeconomic database used in social programmes of the Brazilian government. Epidemiologists and statisticians wish to assess the impact of Bolsa Família (PBF), a conditional cash transfer programme, on the incidence of several diseases (tuberculosis, leprosy, HIV etc). The cohort must contain all individuals who received at least one payment from PBF between 2007 and 2012, which results in a 100-million records according to our preliminary analysis. These individuals must be probabilistically linked with databases from the Unified Health System (SUS), such as hospitalization (SIH), notifiable diseases (SINAN), mortality (SIM), live births (SINASC), to produce data marts (domain-specific data) to the proposed studies. Within this cooperation, our first goal was to design and evaluate probabilistic methods to routine link the cohort, PBF, and SUS outcomes. ApproachWe implemented two probabilistic linkage methods: a full probabilistic, based on the Dice similarity (Sorensen index) of Bloom filters; and an hybrid approach, based on rules to deterministic and probabilistic matching. We performed linkages involving CADU (2011 extraction) and SUS outcomes (SIH, SINAN, and SIM) with samples from 3 states (Sergipe, Santa Catarina and Bahia) with an increasing size (from 1,447,512 to 12,036,010). ResultsUsing a Dice between 0.90 and 0.92, our methods retrieved more than 95% of true positive pairs amongst the linked pairs. For Sergipe, we obtained as <linked pairs,true positives>: <23,22>, <315,300>, <32,32>, respectively for SIH, SINAN, and SIM. For Bahia: <771,593>, <2677,2626>, <208,207>. Another linkage between CADU (1,447,512 records) and SINAN (624 records), for tuberculosis in Sergipe, returned 397 (full probabilistic) and 311 (hybrid) linked pairs, being 306 and 300 true positives. Another execution considering CADU (1,988,599 records) and SINAN (2,094 records), for tuberculosis in Santa Catarina, returned 791 (full probabilistic) and 500 (hybrid) linked pairs, with 667 and 472 true positives. Linking CADU (1.685,697 records) and SIM, for mortality of children under-4, returned 18 linked pairs, all of them true positives, for a Dice between 0.90 and 0.92 and with 100% of sensitivity, specificity, and positive predictive value. ConclusionDue to the absence of gold standards, we use samples with increasing sizes and manual review when adequate. Our results are quite accurate, although obtained with an unique extraction of CADU. We are starting to run linkages with the entire cohort.
Objective: Multimorbidity, or the occurrence of two or more chronic conditions, is a global challenge, with implications for mortality, morbidity, disability, and life quality. Psychiatric disorders are common among the chronic diseases that affect patients with multimorbidity. It is still not well understood whether psychiatric symptoms, especially depressive symptoms, moderate the effect of multimorbidity on cognition. Methods: We used a large (n=2,681) dataset to assess whether depressive symptomatology moderates the effect of multimorbidity on cognition using structural equation modelling. Results: It was found that the more depressive symptoms and chronic conditions, the worse the cognitive performance, and the higher the educational level, the better the cognitive performance. We found a significant but weak (0.009; p = 0.04) moderating effect. Conclusion: We have provided the first estimate of the moderating effect of depression on the relation between multimorbidity and cognition, which was small. Although this moderation has been implied by many previous studies, it was never previously estimated.
ABSTRACT Background and aimsThe Brazilian government has several social protection programmes that select their beneficiaries based on socioeconomic information kept in the CadastroÚnico (CADU) database. The CADU will be used to build a population-based cohort of approximately 100 million individuals. Among the social programmes is the Bolsa Família (PBF), a conditional cash transfer programme that provides extra income to poor families. These two databases must be deterministically linked to individuals who have received payments from PBF between 2004 and 2012. It will be used in epidemiological studies aiming to assess the impact of PBF on the occurrence and severity of several diseases and health problems (tuberculosis, leprosy, HIV, child health etc). This cohort must be probabilistically linked with databases from the Unified Health System (SUS), such as hospitalization, notifiable diseases, mortality, and live births, in order to produce data marts (domain-specific data) to the proposed studies. Our goals comprise the validation of probabilistic record linkage methods to support this cohort setup. ApproachThis paper emphasizes the accuracy assessment of our methods based on the linkage of SIH (hospitalization), SINAN (notifications), and SIM (mortality) records to the 2011 extraction of CADU. We focused on hospitalization and notification of tuberculosis, as well infant mortality for all causes in under-4 children, for a small sample with 30,029 records (CADU). Due to the absence of gold standards, we used two approaches to assess accuracy: a clerical review and an automatic (tool-based) search. In the first case, we used different cut-off points as similarity index to calculate sensitivity and specificity, and a ROC curve to separate matched and non-matched pairs. The second approach retrieves from CADU all matched and non-matched pairs for a given individual, serving as a gold standard for validation. ResultsWe retrieved 22 linked pairs, from which 18 are true positives for infant mortality (SIM database). From SINAN, our results were 434 linked pairs with 166 true positives, and with SIH, 121 linked pairs with 34 true positives. The sensitivity of manual scan for SIM (children mortality) ranges from 44% (specificity of 100%) to 95% (specificity of 94%), with similarity indices between 0.80 and 0.97, respectively. For automatic search, we obtained a sensitivity of 69.2% and specificity of 91.8%. ConclusionOur results show the need for a continuous improvement in our linkage routines and how to consistently evaluate their accuracy in the absence of adequate gold standards.
The Brazilian Early Childhood Friendly Municipal Index (IMAPI) is a population‐based approach to monitor the nurturing care environment for early childhood development (ECD) using routine information system data. It is unknown whether IMAPI can be applied to document metropolitan urban territorial differences in nurturing care environments. We used Brasilia, Brazil's capital with a large metropolitan population of 2,881,854 inhabitants divided into 31 districts, as a case study to examine whether disaggregation of nurturing care data can inform a more equitable prioritization for ECD in metropolitan areas. IMAPI scores were estimated at the municipal level (IMAPI‐M, 31 indicators) and at the district level (IMAPI‐D, 29 indicators). We developed a quantitative prioritization process for indicators in each IMAPI analysis, and those selected were jointly mapped in the socioecological model for the role of indicators in relation to the enabling environment for nurturing care. Out of 28 common nurturing care indicators across IMAPI analysis, only four were prioritized in both analyses: one from the Adequate nutrition, two from the Opportunities for early learning, and one from the Responsive caregiving domains. These four indicators were mapped as enabling policies, supportive services, and caregivers’ capabilities (socioecological model) and Effort, Coverage, and Quality (indicator's role). In conclusion, the different levels of nurturing care data disaggregation in the IMAPI can better inform decision‐making than each one individually, especially in metropolitan areas where municipalities and districts within metropolitan areas have relative decision‐making autonomy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.