Affordable Access

Access to the full text

Exploring hybrid parallel systems for probabilistic record linkage

Authors
  • Boratto, Murilo1
  • Alonso, Pedro2
  • Pinto, Clicia3
  • Melo, Pedro3
  • Barreto, Marcos3
  • Denaxas, Spiros4
  • 1 Universidade do Estado da Bahia, Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Salvador, Bahia, Brazil , Salvador (Brazil)
  • 2 Universitat Politècnica de València, Departament of Information Systems and Computation, Valencia, Spain , Valencia (Spain)
  • 3 Universidade Federal da Bahia, Laboratório de Sistemas Distribuídos, Salvador, Bahia, Brazil , Salvador (Brazil)
  • 4 University College London, Institute of Health Informatics Research, School of Computer Science and Informatics, London, UK , London (United Kingdom)
Type
Published Article
Journal
The Journal of Supercomputing
Publisher
Springer US
Publication Date
Mar 21, 2018
Volume
75
Issue
3
Pages
1137–1149
Identifiers
DOI: 10.1007/s11227-018-2328-3
Source
Springer Nature
Keywords
License
Yellow

Abstract

Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time-consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously exploits multicore and multi-GPU architectures in order to perform the probabilistic linkage of large-scale Brazilian governmental databases. We present some algorithmic optimizations that provide high accuracy and improve performance by defining the best algorithm-architecture combination for a problem given its input size. We also discuss performance results obtained with different data samples, showing that a hybrid approach outperforms other configurations, providing an average speedup of 7.9 when linking up to 20.000 million records.

Report this publication

Statistics

Seen <100 times