IntroducationIn Brazil, the National Health System (SUS) provides healthcare to the public. The system has multiple administrative databases; the major databases record hospital (SIH) and outpatient (SIA) procedures. Epidemiological information is collected for all populations in subsystems, such as mortality (SIM), live births (SINASC) and diseases of compulsory declaration (SINAN). Each subsystem has its own information system, which is able to provide information about consultations, clinical information and medicines dispensed. However, these systems are not linked, thereby preventing individual-centred analysis. ObjectiveTo describe the methods and results of parameter setting that are needed to execute the probabilistic deduplication of large administrative and epidemiological databases in Brazil and to create a National Health Database Centred on the individual. MethodsThis paper shows the results of a record linkage model to integrate data from SIH, SIA, SIM, and SINAN, which have different formats and attributes between them and over time. These data consist of 1.3 billion records from 2000-2015. Probabilistic and deterministic record linkages were used to deduplicate these data. The Kappa statistic and clerical review were used to ensure the quality of the linkage. The graph algorithm and depth-first search were used to generate the identifiers. ResultsThe deterministic deduplication process resulted in a database with 403,113,527 possible unique individuals. After the probabilistic deduplication process of the former database was performed, 159,703,805 unique individuals were identified. This result had an estimated a false positive error rate of 3.3%, and the false negative error was estimated at 12.3%. ConclusionsThe National Health Database centred on the individual was generated and will allow researchers to use real-world evidence to conduct clinical, epidemiological, economic and other studies. This database represents a significant cohort, spanning 15 years of historical data and preserving patient privacy. The success of the process described will allow repeating and appending the data for future years and enable important studies to promote SUS efficiency and provide better treatments for patients.
MethodsSemantic analysis of data was performed to describe and understand different meanings of different fields existing in the studied bases. In addition, there were four main procedures, executed with database operations tools and PLSQL programming language: cleaning and standardization of databases(document's numbers was checked in the brazilian national people's database, with a string approximator algorithm to decide if the document's number belonged or no the register); registration information extraction, deterministic and probabilistic deduplication thereof. The procedures were first performed on each database separately and after the unification of the records, was held again a deterministic deduplication. Except the probabilistic deduplication which was performed only on the final deterministic deduplicated's database.Performed procedures allowed a decision-making to chose fields used in data model for the unified database creation. Nine database's representative fields related to patients were selected: patient's name; patient mother's name; sex; birth date; state; city; zip code; cpf and cns(brazilian documents). ResultsInitially, the unified registration database resulted in 705.599.785 records, after deterministic deduplication there was a reduction culminating in 198.400.762 records. This reduction is explained because these databases are not fully integrated. Moreover, there is not always agreement between systems' semantics and in some cases changes occur in the data format over the period within the same system. After probabilistic deduplication, the number of unique records decreased to 124.545.186 which is explained by non-linked pairs by deterministic process. This result is guaranteed with a estimate error of at most 3.3% of false positive and at most 12.3% of false negative pairs. ConclusionThe results show that data deduplication is necessary and should be carried out thoroughly. Where the databases had limited patients' registration information, the technique enabled to capture, in more complete basis, additional information. Futhermore, it allowed to identify and assist in the understanding of positive and negative aspects within systems and trace clinical condition of patients, enabling pharmacoeconomic and epidemiological studies that define effectiveness and efficiency of public policies and embedded technologies. As future work, is important ensure the univocity of records and link this database with past period.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.