Exploring hybrid parallel systems for probabilistic record linkage

Boratto, Murilo; Alonso, Pedro; Pinto, Clicia; Melo, Pedro O. S. Vaz de; Barreto, Marcos; Denaxas, Spiros

doi:10.1007/s11227-018-2328-3

Cited by 6 publications

(4 citation statements)

References 14 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To circumvent scalability challenges over big data sets, different approaches have been used in the literature, such as parallelism/distribution and blocking (or indexing) strategies, as well as their combinations (Christen, 2008; Pita et al, 2018). Other initiatives have also proposed the use of cluster-based platforms, multi-processors or graphics processing units (GPUs) (Boratto et al, 2018; Pita et al, 2018). Blocking and indexing step generates pairs of candidate records pertaining to the same comparison blocks (Christen, 2012).…”

Section: Data Linkagementioning

confidence: 99%

“…AtyImo, in comparison to previous linkage tools freely available, has reasonably better accuracy and shorter execution time with a major advantage to scale upward to huge databases (Pita et al, 2018). The current version of AtyImo based on the NVIDIA’s CUDA library is able to probabilistically link databases of up 80 million records in around 60 s over multiple GPU architectures (Boratto et al, 2018).…”

Section: Record Linkage Tools Developed And/or Used In Brazilmentioning

confidence: 99%

See 1 more Smart Citation

Administrative Data Linkage in Brazil: Potentials for Health Technology Assessment

et al. 2019

View full text Add to dashboard Cite

Health technology assessment (HTA) is the systematic evaluation of the properties and impacts of health technologies and interventions. In this article, we presented a discussion of HTA and its evolution in Brazil, as well as a description of secondary data sources available in Brazil with potential applications to generate evidence for HTA and policy decisions. Furthermore, we highlighted record linkage, ongoing record linkage initiatives in Brazil, and the main linkage tools developed and/or used in Brazilian data. Finally, we discussed the challenges and opportunities of using secondary data for research in the Brazilian context. In conclusion, we emphasized the availability of high quality data and an open, modern attitude toward the use of data for research and policy. This is supported by a rigorous but enabling legal framework that will allow the conduct of large-scale observational studies to evaluate clinical, economical, and social impacts of health technologies and social policies.

show abstract

Section: Data Linkagementioning

confidence: 99%

Section: Record Linkage Tools Developed And/or Used In Brazilmentioning

confidence: 99%

Administrative Data Linkage in Brazil: Potentials for Health Technology Assessment

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Bloom filters [30,31], which transform bigrams from the linkage key attributes into a binary vector, are used for similarity calculation (matching). Atylmo has proven to be quite effective, providing 93% to 97% of accuracy (true positive rate) depending on the databases being linked [32]. CIDACS-RL, another linkage tool designed over Apache Lucene 12 , uses a novel approach based on an indexing search and sorting algorithm to perform information retrieval.…”

Section: Data Linkagementioning

confidence: 99%

The Center for Data and Knowledge Integration for Health (CIDACS)

Barreto

Ichihara

Almeida

et al. 2019

IJPDS

Self Cite

View full text Add to dashboard Cite

The Center for Data and Knowledge Integration for Health (CIDACS) was created in 2016 in Salvador (Bahia, Brazil). This paper aims to present a profile of CIDACS, including its current databases. CIDACS aims to conduct interdisciplinary studies and research, develop new scientific methodology and promote professional training using linked large-scale databases and high-performance computational resources in a secure environment. Administrative data is at the core of the activities conducted by CIDACS. The advantages of administrative data include significantly larger sample sizes, an inherent longitudinal structure and high-quality information. The center’s research projects are primarily focused on enhancing the understanding surrounding the impact of social protection policies (e.g., public cash-transfer and housing programs) on health outcomes in low-income populations throughout Brazil. CIDACS’ primary data source is citizens who register with the Cadastro Único program, which encompasses individuals eligible to receive benefits from over 20 governmental social programs. CIDACS has two separate environments for data handling: 1) Data Production Center, a secure room housing the computational infrastructure for ingesting, storing, cleaning, processing and linking original databases, as well as extracting research-ready datasets and 2) Data Analysis Environment, a computational infrastructure based on data safe haven principles, which allows researchers to access and process requested datasets. Brazil has a large public health community that uses national health and social databases for research programs, and the linkage of different databases has been a widely employed practice in the country. CIDACS is the result of efforts by researchers, policymakers and public health officials to use and improve the quality of Brazilian health databases. CIDACS is expected to be an important resource for researchers and policymakers interested in improving the evidence base in different aspects of health, as well as with regard to the social determinants of health and the effects of social and environmental policies on health in general.

show abstract

“…Despite its potential for significant improvements in runtime performance, there has not been any further work published on P4Join using larger data sets or on clusters of GPU nodes. More recently, Boratto et al [ 25 ] evaluated a hybrid algorithm using both GPUs and central processing units (CPUs) with much larger data sets. Although restricted to single (highly specified) machines, these evaluations show promise provided that the approach can be applied within a compute cluster.…”

Section: Introductionmentioning

confidence: 99%

Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model

Brown

Randall

2020

JMIR Med Inform

View full text Add to dashboard Cite

Background The linking of administrative data across agencies provides the capability to investigate many health and social issues with the potential to deliver significant public benefit. Despite its advantages, the use of cloud computing resources for linkage purposes is scarce, with the storage of identifiable information on cloud infrastructure assessed as high risk by data custodians. Objective This study aims to present a model for record linkage that utilizes cloud computing capabilities while assuring custodians that identifiable data sets remain secure and local. Methods A new hybrid cloud model was developed, including privacy-preserving record linkage techniques and container-based batch processing. An evaluation of this model was conducted with a prototype implementation using large synthetic data sets representative of administrative health data. Results The cloud model kept identifiers on premises and uses privacy-preserved identifiers to run all linkage computations on cloud infrastructure. Our prototype used a managed container cluster in Amazon Web Services to distribute the computation using existing linkage software. Although the cost of computation was relatively low, the use of existing software resulted in an overhead of processing of 35.7% (149/417 min execution time). Conclusions The result of our experimental evaluation shows the operational feasibility of such a model and the exciting opportunities for advancing the analysis of linkage outputs.

show abstract

Exploring hybrid parallel systems for probabilistic record linkage

Cited by 6 publications

References 14 publications

Administrative Data Linkage in Brazil: Potentials for Health Technology Assessment

Administrative Data Linkage in Brazil: Potentials for Health Technology Assessment

The Center for Data and Knowledge Integration for Health (CIDACS)

Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model

Contact Info

Product

Resources

About