Ted Enamorado and Rebecca C. Steorts. 2020. “Probabilistic Blocking and Distributed Bayesian Entity Resolution.” In: Josep Domingo-Ferrer, Krish Muralidhar (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham.
Entity resolution (ER) is becoming an increasingly important task across many domains (e.g., official statistics, human rights, medicine, etc.), where databases contain duplications of entities that need to be removed for later inferential and prediction tasks. Motivated by scaling to large data sets and providing uncertainty propagation, we propose a generalized approach to the blocking and ER pipeline which consists of two steps. First, a probabilistic blocking step, where we consider that of, which is ER record in its own right. Its usage for blocking allows one to reduce the comparison space greatly, providing overlapping blocks for any ER method in the literature. Second, the probabilistic blocking step is passed to any ER method, where one can evaluate uncertainty propagation depending on the ER task. We consider that of, which is a joint Bayesian method of both blocking and ER, that provides a joint posterior distribution regarding both the blocking and ER, and scales to large datasets, however, it does it a slower rate than when used in tandem with. Through simulation and empirical studies, we show that our proposed methodology outperforms when used in isolation of each other. It produces reliable estimates of the underlying linkage structure and the number of true entities in each dataset. Furthermore, it produces an approximate posterior distribution and preserves transitive closures of the linkages.