Scalable Matching and Clustering of Entities with FAMER

. Entity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution framework called FAMER (FAst Multi-source Entity Resolution system) that is based on Apache Flink for distributed execution and that can holistically match entities from multiple sources. For the latter purpose, FAMER includes multiple clustering schemes that group matching entities from different sources within clusters. In addition to previously known clustering schemes FAMER includes new approaches tailored to multi-source entity resolution. We perform a detailed comparative evaluation of eight clustering schemes for different real-life and synthetically generated datasets. The evaluation considers both the match quality as well as the scalability for different numbers of machines and data sizes.


Introduction
Entity resolution (ER) -also called deduplication, record linkage or object matching -is the task of identifying records that refer to the same real-world entity, such as specific costumers, products or publications.This problem is of key importance for improving data quality and for integrating data from multiple sources.Numerous approaches for entity resolution have been developed and investigated [1], [2].They derive match decisions typically based on the combined similarity of several attribute values and possibly on the contextual similarity of entities (for instance, two publications may match if they have both highly similar titles and co-authors).To achieve high efficiency for large datasets, one has to avoid comparing each entity to all other entities.This is achieved by so-called blocking strategies [1] where only records within the same block (partition) need to be compared with each other, e.g.only publications from the same year are considered.
Entity resolution can also be performed in parallel on multiple processors and computing nodes to achieve additional performance improvements [3].
Most previous ER approaches compare pairs of entities and determine binary match mappings consisting of all correspondences or links between two matching entities.This is a natural approach when one has to integrate only a few data sources but it does not scale well since the number of binary mappings grows quadratically with the number of sources.For instance, integrating data from 200 sources would require the determination (and maintenance) of 19,900 mappings which is not practically feasible with today's ER tools.A better approach for integrating data from multiple data sources is grouping all matching entities within clusters as it allows a more compact match representation than with binary links [4].It also simplifies the fusion of the matching entities for data integration by combining and consolidating the attribute values of the different cluster members.Furthermore, it allows an incremental integration of additional entities and data sources by comparing them with the set of previously determined clusters.
In our research, we aim at scalable ER approaches for Big Data that are able to deal with large data volumes and multiple data sources.We therefore have developed a new framework called FAMER (FAst Multi-source Entity Resolution system) for multi-source entity resolution that supports clustering matching entities and exploits both blocking and distributed (parallel) processing.It is implemented on top of the distributed dataflow framework Apache Flink to achieve a high scalability to large amounts of data and many machines.FAMER includes multiple clustering schemes to group matching entities; and the main goal of this article is to comparatively evaluate the match quality and runtime performance of these schemes.The considered clustering schemes require, as input, a so-called similarity graph containing all links between matching entities and try to find additional links by considering indirect matches and to eliminate weaker links in favor of more plausible ones.The clustering schemes include previously known clustering schemes (connected components, center clustering, star clustering, merge clustering, correlation clustering) as well as two newly developed approaches for multi-source entity resolution dubbed SplitMerge (introduced in [5], [6]) and CLIP [7].In total, we perform a detailed comparative evaluation of the match quality and scalability of eight clustering schemes for different real-life and synthetically generated datasets.
This article is an extended version of the conference publication [8].Compared to [8], we here provide a more detailed discussion of related work and add the description and comparative evaluation of the SplitMerge and CLIP clustering schemes.The CLIP algorithm investigated here is an optimized version of the initial approach of [7] with much better runtimes.
In the next section, we discuss related work on entity resolution and clustering.In Section 3, we provide an overview about our FAMER framework.Section 4 describes the considered clustering algorithms and their distributed implementation.In Section 5, we evaluate the match quality and scalability of the approaches for different datasets.Section 6 summarizes our findings and discusses future work.

Related Work
There is a huge amount of literature about ER and there are several books and surveys to provide an overview about the main methods and tools, e.g.[1], [2], and [9].The decision whether two entities match is typically based on the combined similarity of several attribute values and possibly on the contextual similarity of entities.In current systems, the combination of the similarity values for deriving a match decision is either based on supervised classification models (learned from training examples) or on manually determined match rules.To achieve high efficiency for large datasets, one has to avoid comparing each entity to all other entities.This is made possible by utilizing so-called blocking strategies [1], [10] and additional filter techniques tailored to specific similarity or distance functions (e.g. the triangle inequality for metric-space distance functions) [11].Further performance improvements are achieved by performing ER in parallel on multiple processors and computing nodes, e.g. on Hadoop platforms.Proposed approaches are primarily based on the use of MapReduce, e.g.[3], [12], and [13].Only some initial approaches consider the use of the Apache Spark framework for distributed ER [14], [15].FAMER utilizes Apache Flink which is similar to Apache Spark and both frameworks improve on MapReduce due to a better utilization of in-memory processing and better support for iterative algorithms as needed for clustering [16].
Most of previous ER algorithms try to find matches either in a single source or between two sources only.For a single source, matching entities are typically grouped within disjoint clusters such that any two entities in a cluster should match with each other and no entity should match with entities of other clusters.For two sources, the match result is mostly a binary mapping consisting of pairs of matching entities (also called match correspondences or links).Binary match mappings may be postprocessed to determine clusters of matching entities, e.g. by calculating the transitive closure of the correspondences (connected components) in the simplest case.In FAMER, we extend this approach to more than two sources by first determining a similarity graph with binary match links between entities and then determining clusters of matching entities within the similarity graph.A similar use of similarity graphs has been considered in [17] and [18].
Hassanzadeh and colleagues [19] comparatively evaluated several clustering methods for single-source ER.We implemented parallel versions based on Flink of the best-performing approaches from this study and added them to FAMER, in particular, correlation clustering [20], Center [19], Merge Center [19], and two versions of Star [21] clustering in addition to connected components as a baseline approach.FAMER further includes two clustering algorithms specifically proposed for multi-source entity resolution, SplitMerge [5] and CLIP [7], that will also be evaluated in this article.Both approaches start with determining connected components, but post-process the resulting clusters (components) to obtain better clusters.In SplitMerge, clusters can be split if they contain entities with a low similarity to other cluster members; in a final merge phase some clusters, e.g.singletons from the split phase, can be merged with other similar clusters.CLIP considers different link types in a similarity graph and focuses on the use of so-called strong links for clustering.It is optimized for duplicate-free sources and ensures that each cluster has at most one entity per data source.CLIP can also be used to repair clusters determined by other cluster schemes [7], but this will not be studied in this article.
The comparative evaluation of different clustering schemes, in this article, allows a detailed analysis of their suitability for multi-source ER.For the first time, we here provide the comparison of SplitMerge algorithm with other clustering schemes.In contrast to previous evaluations such as in [19] we consider parallel implementations of the algorithms and also evaluate runtimes and scalability for different data sizes.

FAMER Framework for Multi-Source Entity Resolution
Figure 1 illustrates the main components and processing steps of the FAMER framework for distributed multi-source entity resolution.The components are similar to the ones in previous entity resolution tools, but thus support more than two sources and are implemented in Apache Flink to achieve a parallel execution for high scalability.The input of FAMER are thus multiple data sources with the entities to be matched and clustered.The output is a collection of clusters where all entities within a cluster match with each other and different clusters refer to different real-world objects.All entities of a cluster are assumed to match with each other, so that a cluster of m entities represents m • (m − 1)/2 match pairs.
In this article, we assume that the entities of the different sources are comparable, i.e. they belong to the same entity type (e.g.persons, products cities, etc.) and have comparable attributes.We further assume that all sources are duplicate-free so that we only have to find matching entities between sources.This is based on the experience that data sources should first be preprocessed and cleaned before data integration, in particular duplicates within data sources should first be removed or fused before matching with other sources [22].The final match clusters should thus be source-consistent [7], i.e. they should contain at most one entity from each input data source.As a result, the maximal size of source-consistent clusters is limited by the number of sources.
FAMER consists of two main parts (Figure 1): generation of a similarity graph based on pairwise matches, and clustering.The first component has several steps (blocking, pairwise comparison, match classification), which can be customized according to a configuration input.We provide more details on the different steps below.We also illustrate the workflow of our framework for the person records in Table 1 that originate from four sources A, B, C and D and contain erroneous attribute values as typical for real-world data.Entities with the same index are assumed to belong to the same cluster, e.g.entities a 3 from source A and b 3 from source B. Table 1 groups already the matching records referring to the same person.In the first phase, we start with a blocking step to reduce the number of comparisons compared to a naïve approach where each entity of a data source has to be compared against all entities of any other source.FAMER supports different blocking techniques such as Standard Blocking (SB) and Sorted Neighborhood as well as single-and multi-pass blocking [1].For SB, which we will use in our evaluation, entities are partitioned into blocks by a predefined blocking key (to be provided in the configuration input) on attribute values such that only entities with the same blocking key need to be compared with each other.For the 14 person records in Table 1, we assume that the two initial letters of the surnames form the blocking key.Table 2 shows the resulting blocking key values and blocks sharing the same key value.For this example, even though the entities are not evenly distributed in blocks, blocking reduces the number of comparisons from 91 to only 66+1=67.Such heavily skewed block sizes can result in significant runtime problems in a distributed implementation since the processing of large blocks can overload certain processing nodes while others with smaller blocks are underutilized.In order to achieve load balancing, FAMER supports the so-called Block Split method proposed in [12] where large blocks can be processed in several processing nodes.On the other hand, blocking may lead to missing some matches if similar entities are assigned to different blocks (e.g.entities with id c 2 and d 2 are not paired with entities a 2 and b 2 ).Such missing matches may persist even during clustering and can thus limit the achievable match quality.Multi-pass blocking can reduce this problem (at the expense of more comparisons) by partitioning the entities according to multiple blocking keys.
After blocking, all entities of a block from any of the input data sources are pairwise compared with each other.For each entity pair, we compute the similarity of their attribute values for the attributes and similarity functions specified in the configuration input.Currently, FAMER supports such attribute-based similarity computations for different string similarity metrics (e.g.q-gram, Jaro Winkler, edit distance) and domain-specific similarity functions, e.g. using distance between geographical entities.These similarity values are used in the following match classification step to decide about whether or not a pair of entities is assumed to match.The classification approach is also specified in the configuration input, e.g. by match rules specifying the required minimal similarity for the considered attributes.A future version of FAMER will also support supervised match classification where training sets of matching and non-matching entity pairs are used to learn a classification model, e.g.decision trees or support vector machine (SVM) models [23].
The output of match classification is the set of matching entity pairs (links) together with a combined similarity value per link.This output is stored as a similarity graph where entities are represented as vertices and match links as edges.Formally, a similarity graph SG= (V,E) is a graph in which vertices of V represent entities and edges of E are links between matching entities.There is no direct link between entities of the same source due to the assumption of duplicate-free sources.Edges have a property for the similarity value (real number in the interval [0,1]) indicating the degree of similarity.
The clustering step of FAMER aims at grouping together all matching vertices of the similarity graph based on the link structure of the graph and possibly the similarity values.Clustering algorithms typically try to group entities such that the similarity between entities within a cluster is maximized while the similarity between entities of different clusters is minimized.Compared to the similarity graph, the clustering algorithm can ideally add all missing matches (links) and remove all wrong links.As indicated in Figure 1, FAMER currently includes eight clustering algorithms that we describe and evaluate in the following sections.Two of them, SplitMerge and CLIP, require additional configuration parameters.
We have also developed a tool to visually analyze the similarity graphs and clusters determined by FAMER [24].The tools support the interactive exploration of large graphs and cluster sets, e.g. to analyze potential problems like unusually large, source-inconsistent or overlapping clusters.
Figure 2 illustrates the results of the described workflow for the sample entities of Table 1 and standard blocking as shown in Table 2.The entities are compared pairwise within the blocks and a rule-based match classification is applied resulting in the similarity graph shown in the middle of Figure 2. Compared to the matches assumed in Table 1, the graph misses some links between matching entities, e.g. between a 1 and c 1 .Employing ER clustering algorithms, the final clustering determines five clusters which are meant to represent different persons.In fact, the resulting clusters correspond to the ones shown in Table 1 so that even the entities a 2 , b 2 , c 2 , d 2 from different blocks are correctly grouped together (which is possible for SplitMerge clustering).
FAMER is implemented using Apache Flink and a new extension for graph analytics called Gradoop [25], [26].Hence, all match and clustering approaches can be executed in parallel on Shared Nothing clusters of variable size.Gradoop supports an extended property graph model so that we store the attribute values of entities as key value properties.Analogously, the similarity values of matching entity pairs are represented as edge properties.For the implementation of the parallel clustering schemes we also use the Gelly library of Flink supporting a so-called vertix-centric programming of graph algorithms (see next section).1.

Clustering Approaches
In this section, we present the eight clustering approaches for entity resolution and their parallel implementation.As described in the previous section, all algorithms use as input a similarity graph with entities from multiple data sources and similarity edges indicating the computed degree of similarity.The clusters determined by the algorithms group a set of entities from different sources that are assumed to represent the same real-world entity.In our implementation, especially for the CLIP algorithm, we also include the similarity links between cluster members from the originating similarity graph.Hence a cluster C i is represented by a cluster graph C i =(V i ,E i ) with the clustered entities in V i and intra-cluster similarity links in E i .The SplitMerge algorithm also determines a so-called cluster representative for each cluster that is used to determine the similarity between clusters to decide about whether clusters should be merged.
The parallel implementations are based on a vertex-centric programming model, also known as 'think like a vertex', to iteratively execute a user-defined program in parallel over all vertices of a graph.In particular, we use the two-step Scatter-Gather model of Gelly that breaks up vertex programming into two functions.In the Scatter step, a value is distributed to all vertex neighbors, and in the Gather step the inputs from the neighbors are collected to update the state of a vertex.The computation proceeds in synchronized iteration steps, called supersteps.Each scatter and each gather execution is performed in a different superstep.Supersteps are executed synchronously, so that messages sent during one superstep are guaranteed to be delivered in the beginning of the next superstep [27].The vertex functions are executed by a configurable number of worker nodes among which the graph data is partitioned, e.g. according to a hash or range partitioning on the vertex ids.We will explain the vertex-centric implementation in detail for one of the clustering schemes (Center); the other implementations follow similar approaches.

Connected Components
The similarity graph contains one or more connected components, i.e. subgraphs in which any two vertices are connected to each other and where there is no connection to other components.In the similarity graph of Figure 2, there are two connected components: a small one with the two entities c 2 and d 2 and a bigger one with all other entities.Having the input similarity graph, the connected components are easy to determine in a vertix-centric way by letting every vertex iteratively add all its direct neighbors to its cluster.The approach is therefore easy to implement with Scatter-Gather (as shown in [27]).In the evaluation, we use this approach as a baseline for the comparison with other clustering schemes.It is expected to find additional matches (and thus improving recall) by grouping indirectly matching entities within clusters (components).On the other hand, it may lead to poor precision since indirect matches may not be similar enough to really represent the same real-word object.

Center Clustering
In contrast to connected components, the Center clustering algorithm [28] utilizes the similarity values (weights) of the edges in the similarity graph.In the sequential algorithm, edges are first sorted based on these weights in descending order and put in a queue.Edges are then removed from the queue and processed one by one.For each edge e (v i ,v j ) , if both v i and v j are unassigned to any cluster, one of them will be center and the other will belong to the cluster of that center.If one of them is a center and the other is unassigned, the unassigned vertex will belong to the cluster of the center vertex.If both vertices are centers or both of them are non-centers, or one of them is non-center and the other is unassigned, that edge is ignored.

Algorithm 1: Parallel Center
/* priority according to a random permutation of vertices */ We propose and implemented a parallel version of the Center algorithm (see Algorithm 1).In each round of the algorithm for all unassigned vertices, the outgoing edge with the highest weight must be found.The vertices on both sides of this edge are then processed.If one of them is center, the other will belong to the cluster of that vertex (lines 6-8).If one of them is assigned to another cluster (line 9), i.e, both vertices belong to different clusters, the edge between these two vertices is removed (line 10).If both vertices are unassigned and the edge between them is for both the outgoing edge with the highest weight (line 13, i = k), then one of them is assumed as center (line 14) and the other will belong to the same cluster in the next round.For selecting the center in this case we make use of initially assigned (line 1) vertex priorities as done in the sequential algorithm.Hence, the vertex with higher priority is considered as a center (line 16, i > nn).If a vertex is not connected to any other vertexes (line 13, v nn = N ull), it is a singleton.The algorithm iterates until all vertices are assigned to a cluster (line 17).We implemented parallel Center using the Scatter-Gather model (see Algorithm 2).The algorithm applies two phases that are iteratively executed for all vertices.Phase 1 (Scatter1, Gather1) finds for each vertex v i its neighboring vertex with the currently highest edge weight, and phase 2 (Scatter2, Gather2) processes the status of the found vertex and assigns v i to an existing cluster or considers it as a center.Again, we initially assign a priority per vertex (line 3).In phase 1, for each vertex v i the neighbor with the K-highest edge weight (nearest neighbor NN) is found (lines 13-21).K is a helper variable.It helps to prevent that already assigned vertices are chosen again as neighbors.It is attached to each vertex and initialized with 1 (lines 5-7).It will be incremented in phase 2 when a vertex neighbor has been assigned to a cluster (lines 39-41).In phase 2, all neighbors of a vertex v i are sorted and processed in descending order of the edge weights (for the edges to v i ) (lines 32-38).Then vertex v i is set as a center similar to Algorithm 1 (lines 42-47).

Merge Center
The Merge Center clustering algorithm [28] is a modified version of Center.In contrast to Center, it merges two clusters if a vertex in one cluster is similar to the center of another cluster.Our parallel implementation for Merge Center is very similar to parallel Center but applies an extra iteration for merging clusters.This iteration is initiated right after all vertices are assigned to a cluster.The merge processing is repeated until there are no further cluster changes.

Star Clustering
The Star clustering algorithm [21] initially computes the degree for each vertex of the similarity graph.Then in each iteration, the unassigned vertex with the highest degree becomes center and all its direct neighbors are assigned to its cluster.The algorithm terminates when all vertices are assigned to a cluster.In contrast to all other clustering approaches, Star clustering can result in overlapping clusters.As a consequence, it introduces the need of a post-processing to select the best cluster for entities that have been assigned to several clusters.
Our parallel version of the Star algorithm is described in Algorithm 3. Initially, the degree of all vertices is computed and, if the degree of a vertex is greater than the degree of all its neighbors, that vertex becomes a center (lines 4-7).If the degree of two adjacent vertices is equal, the one with higher priority is assumed as a center.Similar to the previous parallel algorithms, vertex priority is initially determined by generating a random permutation of vertices (line 1).Then each center and all its neighbors are considered as a cluster.(lines 8-12).The Scatter-Gather version of Algorithm 3 uses three phases.In the first phase the degree of each vertex is computed.In the second phase, centers are selected, and in the final phase, clusters are grown around the centers.

Algorithm 3: Parallel Star
We use two methods for computing the degree of vertices resulting into algorithms Star-1 and Star-2.For Star-1, we count the number of outgoing edges of a vertex, while Star-2 is based on the average similarity degrees of the outgoing edges of a vertex.

CCPivot Correlation Clustering
The original correlation clustering approach [29] uses a graph with positive and negative edge weights to indicate whether two vertices are similar (positive edge weight) or dissimilar (negative edge weight).The goal is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters).Gionis et al. propose an approximate and iterative solution for this optimization problem [30] that randomly selects an unassigned vertex as a cluster center in each round.Then all unassigned neighbors of the selected center are added to the cluster and marked as assigned.The algorithm terminates when there is no unassigned vertexes left.
This simple algorithm suffers from too many rounds making it unsuitable for very large graphs.Some studies therefore proposed parallel solutions [20], [31] that select multiple centers in each round.They also address the newly introduced concurrency problem to avoid that a vertex is assigned to more than one center at a time.We implemented the parallel pivot approach of [20], called CCPivot, since it fits well the Scatter-Gather paradigm.In each round of this algorithm, several vertices are considered as active nodes, i.e. as candidates for becoming a cluster center (or pivot).In the next step, active nodes that are adjacent to each other are removed from the set of active nodes; the remaining vertices become centers.Then adjacent vertices of each center are assigned to that center and form a cluster.If one vertex is adjacent of more than one center at the same time, it will belong to the one with higher priority.As in the other algorithms, the vertex priorities are determined in a preprocessing phase.
Our Scatter-Gather implementation of this algorithm uses three Scatter-Gather phases: one for computing the current maximum degree of the graph, one for selecting active nodes and applying the concurrency-aware rule to select final centers, and one for growing clusters around centers.

SplitMerge Clustering
The SplitMerge approach proposed in [5] is more general than the other clustering schemes as it can deal with entities of different semantic types as well as dirty input sources and links, e. g.with duplicates in sources.Furthermore, SplitMerge can compute additional links between entities based on a similarity function provided within a configuration parameter.Further parameters are similarity thresholds for the split and merge phases and a blocking function for the merge phase.
Algorithm 4 shows the pseudo-code of the SplitMerge approach consisting of three main phases: (1) determining initial clusters by applying connected components and making the components source-consistent, (2) splitting clusters to ensure a high intra-cluster similarity and (3) merging similar clusters.In contrast to [5], we have omitted the preprocessing phase since we only consider entities of a single type and duplicate-free sources in this study.The application of SplitMerge to the similarity graph from Figure 2 is illustrated in Figure 3.More details on the Flink implementation of SplitMerge are described in [6].SplitMerge starts with computing connected components (line 2 of Algorithm 4) on the input similarity graph to create initial components C init .The resulting components may often violate the required source consistency since entities from the same source may be indirectly linked and thus become members of the same connected component.In our example in Figure 3, there are only two connected components where the smaller one (with entities c 2 and d 2 ) is source-consistent but the larger one contains up to four entities per source.To achieve source-consistent clusters, we decompose the inconsistent components by removing links that result in a violation of source consistency.The links between (a 0 , b 0 ) and (b 0 , a 1 ) result in a source inconsistency for source A and we solve this by removing one of the two links (the one with lower similarity).Another example with three links resulting in a source inconsistency is (b 1 , c 1 , a 2 , b 2 ); again, we eliminate at least one link, e. g., (c 1 , a 2 ), to solve the problem.

Connected components
To identify the links to be removed, we record for every entity e the set of already associated data sources in an element assocSrc(e) which initially contains the source of e (line 4).We iterate over all links of a component in descending order of their similarity.For each considered link (e s , e t ), we check whether it results in a source inconsistency which is the case if there is a non-empty overlap between assocSrc(e s ) and assocSrc(e t ).If there is such a conflict, the link will be eliminated (line 8).Otherwise, we update both sets of associated sources to the union of assocSrc(e s ) and assocSrc(e t ) (line 10).In the example of Figure 3, the conflicting links that are removed are shown in red.For instance, if we first process link (a 0 , b 0 ) we will have sources A and B in assocSrc(a 0 ) and assocSrc(b 0 ).The link (b 0 , a 1 ) will then lead to a conflict for b 0 which is already associated with source A so that this link is eliminated.After the processing of all links, we determine the connected components with the remaining links to compute the source-consistent subcomponents (line 11).In our running example, we obtain the four smaller clusters shown (with green borders) in the third graph from the left in Figure 3.
For the split phase, we process the clusters from the first phase in parallel.For each cluster, we first determine link similarities for each pair of entities based on the similarity function f sim provided in the input.This is needed to identify entities with an insufficient similarity to other cluster members.To determine possible splits (line 15) we determine for each entity the average similarity of its links to other cluster members and separate an entity if the average similarity is below the split threshold t s .After the elimination of such entities, we iteratively repeat this split processing based on recomputed entity similarities until all entity similarities are at least as high as threshold t s .In our example, this processing leads to the elimination of d 4 from cluster a 2 , b 2 , d 4 (fourth graph from the left in Figure 3).For each resulting cluster, we next determine a cluster representative (line 16) from the properties of the cluster members, e. g.based on the values of preferred sources or a majority consensus of values.As indicated in Figure 3, each cluster representative has a unique id and keeps track of the covered cluster entities and their sources as provenance information.The representatives are used for a simplified computation of cluster similarities as needed for the final merge phase.
The goal of the merge phase is to identify highly similar pairs of clusters that likely represent the same real-world entity and should thus be combined.This can also help to assign entities separated during the split phase to a more similar cluster.The first step is to determine a so-called cluster mapping CM (line 18 of Algorithm 4) consisting of all cluster pairs with a similarity above the merge threshold t m (merge candidates).The similarity between clusters is computed by applying function f sim on the cluster representatives.Since the computation of these similarities is an expensive process for many clusters, we reduce the number of comparisons by applying a blocking function bf specified as an input parameter (in the current implementation we apply standard blocking on selected properties of the cluster representatives).Furthermore, we only compare clusters with entities from different sources since otherwise merging these clusters would violate source consistency.In our example in Figure 3, we have three clusters in the first block and Algorithm 4: SplitMerge Clustering only one in the remaining three blocks.For the first block, we obtain two merge candidates with a sufficiently high cluster similarity.Cluster merging is an iterative process (lines 19 to 23) that continues as long as there are merge candidates in the determined cluster mapping CM.In each iteration, we select the pair of clusters (c 1 , c 2 ) with the highest similarity from CM (line 20) and merge it into a new cluster c m (line 21).This merging also includes the computation of a new representative for c m .The "old" clusters c 1 and c 2 are removed from the cluster set and the new cluster c m is added.We further need to adapt CM by removing all cluster pairs involving either c 1 or c 2 (line 22).Furthermore, we have to extend CM by similar cluster pairs (c i , c m ) for the new cluster c m with a cluster similarity of at least t m and entities from different sources (line 23).For our running example, we first process the merge candidate with similarity 0.9 and obtain the merged cluster {a 2 , b 2 , c 2 , d 2 }.The second merge candidate will be removed and it is checked whether the new cluster results in new merge candidates.Since the new cluster contains already entities from every source, merging any other cluster would result in a source inconsistency so that no new merge candidates result in the example.The final outcome of SplitMerge contains five clusters which correspond to the perfect result in Figure 2.

CLIP
The CLIP algorithm (Clustering based on LInk Priority) [7] is able to produce source-consistent clusters.It utilizes different link characteristics such as the link strength and link degree that we introduce first before outlining the approach.In a similarity graph, an entity from a source A may have several links to entities of a source B. From these links, the one with the highest similarity value is called maximum link.For instance, for entity a 1 in Figure 4 the maximum link with respect to source B is the one with similarity 0.95 to entity b 1 .Based on this concept we define the strength of links and classify them into strong, normal, and weak links.Considering a link ℓ between entity e i from source A and entity e j from another source B we define these link types as follows: • Link ℓ is classified as a strong link, if it is the maximum link from both sides, i.e. for e i to source B and for e j to source A. In Figure 4, entity a 1 from source A has a strong link, colored in green, to b 1 in source B. Note that an entity can have several strong links to different sources, e.g. a 1 is also strongly linked to c 2 from source C. • Link ℓ is called a normal link, if it is the maximum link for only one of the two sides.In Figure 4, the link between a 1 and b 2 is a normal link (colored in blue) as it is the maximum link from b 2 to source A, but not the maximum link from a 1 to source B. • Link ℓ is a weak link, if it is not the maximum link for any of the two sides.In Figure 4, the link between a 1 and b 0 is such a weak link and is shown with a red dashed line.
Furthermore, we define link degree of a link as the minimum degree of the two linked vertices.In Figure 4, the vertex degree of a 1 is 4 and the vertex degree of b 1 is 3, so that the link degree between a 1 and b 1 is min(4, 3) = 3.Finally, we call a source-consistent cluster that contains entities from all sources a complete cluster.In Figure 2, the cluster containing the entities with index 0 is a complete cluster since it is source-consistent (at most one entity per source) and contains at least one entity for each of the four sources.The definition implies that complete clusters contain exactly one entity from each input data source.
The CLIP algorithm favors strong links for finding clusters while weak links will be ignored.This aims at finding good clusters even when the similarity graph contains many links with lower similarity values.The approach works in two main phases.In the first phase, CLIP determines all complete clusters based on strong links between entities from all sources.The second phase also considers normal links and iteratively clusters the remaining entities based on link priorities such that no source-inconsistent clusters are generated.
The pseudocode of CLIP is shown in Algorithm 5. Its input is a similarity graph SG and a configuration parameter specifying how tho determine link priorties; the output is the cluster set CS. Figure 5 illustrates the algorithm for the entities and similarity graph from our running example from Figure 2. In phase 1, we start with determining the strength of all links (line 2 of Algorithm 5).Then we apply computeConnectedComponents on the graph with vertices V and only strong links E Strong to identify complete clusters and add these to the output (lines 3-4).In the example of Figure 5, the second graph in the upper half differentiates between strong, normal, and weak links by showing them as green, blue and dashed red lines, respectively.Focusing on strong links, we CS i obtain four connected components in the example, one of which (for index 0) results in a complete cluster that is added to the output of phase 1.
For phase 2, we remove the vertices and edges from the complete clusters.Furthermore, we ignore weak links and only consider strong and normal links (lines 5 of Algorithm 5).Again we use computeConnectedComponents to consider the resulting connected components as possible clusters (line 6).Afterwards these components C i are processed in parallel (line 7).If the cluster C i is already a source-consistent cluster, it is directly added to the CLIP output (lines 8-9).Otherwise the component/cluster is source-inconsistent and will be processed as outlined below.In the example of Figure 5, phase 2 is illustrated in the lower part which starts with a reduced similarity graph that has no longer the entities from the complete cluster determined in phase 1 and that only contains strong and normal links.We then obtain three connected components two of which (with index 2 and index 3) are already source-consistent clusters that are thus added to the output.The remaining source-inconsistent component/cluster needs further processing.
In the processing of source-inconsistent clusters/components we sequentially process the intra-component links (lines [12][13][14][15] in the order of their maximal link priority (determined by sortLinksByPriority in line 11) which is based on the link similarity value, link strength and link degree.The parameter conf ig in line 11 determines the weight of these three factors to compute the link priority.Assuming that the individual entities are singleton clusters in the beginning, we iteratively process the links to determine whether the clusters of the linked entities can be merged without introducing source inconsistency.In line 13, we thus check for each link (e s , e t ) whether their clusters C s and C t are compatible, i.e. they do not include more than one entity per source.Only if this the case, we merge the two clusters and update the cluster set accordingly (lines 14-15).The In the example of Figure 5, we start with the link between a 2 and b 2 in the third graph for phase 2 and merge these entities into a new cluster.Then the link between b 1 and c 1 is selected and these entities are merged into one cluster as well.Then the link from c 1 to a 2 is taken.The clusters on the ends of this link are not compatible because both have one entity from source B. Processing all links in sorted order in the example leads to adding the entity a 1 to the cluster containing entities of index 1.Similarly, the entity d 4 is added to the cluster containing entities a 2 and b 2 .The output of phase 2, together with the output of phase 1, results in five clusters.Compared to the perfect result shown in Figure 2, only three clusters (with indices 0, 1, 3) are correct while the entities with index 2 are not grouped together because they were not linked in the similarity graph due to the lossy blocking approach applied.
The clustering in the second phase is an iterative process mainly based on link priority.The initial CLIP implementation of [7] updated link priorities in each iteration thereby causing high runtimes.The optimized CLIP version, described in this section, computes the link priorities only once and thus uses static priorities in the second phase.We found out that the new approach achieves about the same result quality but leads to much lower runtimes which will be presented in Section 5.

Evaluation
The goal of evaluation is to comparatively evaluate the effectiveness and efficiency of the considered clustering approaches and their distributed implementations for different datasets and configurations.We first describe the used datasets from three domains and the considered configurations.We then analyze the relative match and clustering effectiveness of the clustering schemes.Finally we evaluate the runtime performance and scalability of the approaches.functions or geographical distance) to compute attribute similarities and require the similarities to reach or exceed a minimal fixed or variable similarity threshold θ.

Match Quality of Clustering Approaches
To evaluate the ER quality of our clustering results we use the standard metrics precision, recall and their harmonic mean, F-Measure.These metrics are determined by comparing the computed match pairs (derived from the computed clusters assuming that all entities in a cluster match) with the perfect match results.
In Figure 6, we compare the obtained precision, recall and F-Measure results for the eight clustering schemes, different similarity thresholds θ and our three datasets using the default configurations from Table 4 to determine the initial similarity graphs.We also include the results obtained already with the similarity graphs used as input to the clustering schemes, although these graphs only contain links, but no clusters.Furthermore, we show the results for a SplitMerge variation called Split that leaves out the merge phase for faster processing.We observe that for DS1 and DS3 most clustering schemes achieve a relatively high F-Measure of more than 0.9 (DS1) and 0.8 (DS3) for the considered θ range between 0.75 and 0.9.By contrast, for the noisy data records of DS2 we had to lower the similarity thresholds to values between 0.35 and 0.45 and still could mostly not exceed the quality of the input similarity graph (with a maximal F-Measure of about 0.75) underlining that DS2 represents a more difficult match problem than DS1 or DS3.For SplitMerge, we also experimented with different values for the split and merge thresholds and we found that the split threshold should be chosen lower than the similarity threshold θ so that clusters are only split when there are links with a low similarity.By contrast the merge threshold should be higher than θ so that only very similar clusters should be merged.The shown results refer to a fixed setting per dataset, e.g. a split threshold of 0.4 and a merge threshold of 0.8 for DS1.

Precision
Comparing the clustering schemes, we observe that there are substantial differences in their relative match quality.Connected Components reaches the lowest F-Measure for all datasets and almost all threshold values because it suffers from very poor precision values.Merge Center shows a similar behavior in terms of poor precision and F-Measure, indicating that the merging of clusters can often lead to wrong cluster decisions.From the other previously known ER clustering schemes (CCPivot, Center, Star-1, and Star-2), Star-1 has the lowest F-Measure especially for lower values of the similarity threshold values.The other approaches, Center, Star-2 and CCPivot, are superior although they can exceed the F-Measure of the input graph in only few cases (Star-2 for DS1, Center for DS3).The better quality of Center comes from its initial focus on edges with high weights thereby ignoring edges with lower similarity.Star-2 is better than Star-1 since its degree-based selection of cluster centers is based on a high degree of similarity to neighbors rather than only the number of neighbors.CCPivot improves precision over the input similarity graph but suffers from lower recall so that F-Measure is not improved over the similarity graph.
By contrast, the two newly introduced algorithms, CLIP and SplitMerge (as well as Split), achieve excellent match quality and outperform all previous algorithms (and the input similarity graph) in terms of precision and F-Measure for all three datasets.CLIP generally reaches the best precision due to its ignorance of weak links making it effective even for low similarity thresholds as necessary for low data quality.The recall of CLIP, SplitMerge and Split is also among the best values achieved, especially for SplitMerge which is based on connected components and where the final merge phase helps to find additional links.A closer inspection of the CLIP behavior showed that its good recall is already achieved by determining the connected components for finding complete clusters and source-consistent clusters involving only strong and normal links.Comparing Split and SplitMerge, SplitMerge always achieves a slightly better F-Measure because its merge phase leads to a better recall than for Split that more than outweighs a somewhat reduced precision.For DS2, Split resp.SplitMerge are significantly better than CLIP and the other approaches due to a high precision resp.recall while CLIP outperforms SplitMerge for DS3 due to a better precision.
These observations are confirmed by Figure 7 showing the average F-Measure results of the clustering schemes over all threshold configurations.The vertical lines show the F-Measure spread between the minimal and maximal value for the different threshold values used to determine the input similarity graphs.We again observe the low and highly variable match quality of connected components and MergeCenter.By contrast, the remaining algorithms including the top-performing SplitMerge and CLIP algorithms are more robust and achieve much better F-Measure values.Interestingly, the Split approach alone achieves almost the same high F-Measure than SplitMerge.SplitMerge always achieves the same or better recall and F-Measure than Split, but the additional gains in F-Measure are small (at most 2% for DS2).CLIP is similarly effective as Split and SplitMerge, but it is easier configurable since it does not require the specification of additional similarity thresholds for splitting and merging.SplitMerge since it requires the similarity computation for a large number of cluster pairs and an expensive iterative merge processing.CLIP with the new implementation is even faster than Split and thus among the fastest algorithms.The old, iterative version of CLIP needed about 5000 s with 16 workers for DS3 with 10 parties [7] so that the new implementation improves runtimes by about a factor 20 for this dataset.
Except for connected components, all algorithms can reduce their runtimes by applying more workers, especially for the larger dataset with 10 parties.Figure 8 shows the resulting speedup values.For DS3 with 5 parties, most algorithms except the iterative CCPivot and SplitMerge approaches achieve an almost linear speedup.By contrast, the high-quality approaches Split and CLIP scale well for this dataset.
For the bigger dataset with 10 parties, speedup values are mostly even better and partly super-linear.The latter, however, is an artifact for the slower algorithms like Merge Center that perform poorly for 4 workers because of memory bottlenecks (its runtime for 4 workers is almost 6 times higher for 10 parties than for 5 parties).The substantially increased aggregate memory capacity for 8 and 16 workers thus enabled super-linear runtime improvements but without reaching the absolute runtimes of fast algorithms like Star-2.Again, SplitMerge scales poorly due to the overly expensive merge phase while Split and CLIP achieve both low absolute runtimes and good speedup.
The high runtimes for SplitMerge (and CCPivot) are heavily influenced by the underlying Flink and Gelly systems and its approaches for iterative processing leading to high memory and communication overhead.We are investigating possible performance optimizations to make the approaches more scalable.

Conclusions and Outlook
We presented a new scalable entity resolution (ER) framework called FAMER supporting the parallel linking and clustering of entities from multiple sources.The parallel execution of ER workflows is based on the Big Data framework Apache Flink.For entity resolution, FAMER first builds a similarity graph linking similar entities from all sources and then applies clustering to group together matching entities.For parallel clustering we currently support eight approaches that have been comprehensively evaluated for datasets from three domains.The evaluation showed that the clustering approaches CLIP, SplitMerge and Split (SplitMerge without merge phase) achieve a high match quality that is clearly superior to other previously known ER clustering schemes.In particular, they ensure source-consistent clusters with at most one entity per source as required for duplicate-free sources.Unfortunately, the current implementation for SplitMerge is expensive and not yet scalable to large datasets.By contrast, both Split and CLIP achieve high match quality and good execution times and scalability making them good default schemes for multi-source entity clustering.
We are currently investigating performance optimizations of the SplitMerge algorithm to make it more scalable.We have also started to investigate incremental clustering approaches, where entity clusters are incrementally extended for new entities and new datasets [34].We further plan to make the FAMER tool with the proposed clustering schemes publicly available and apply it in several applications, in particular to build large, high quality knowledge graphs.

Figure 1 .
Figure 1.Overview of the FAMER approach for multi-source entity resolution

Figure 2 .
Figure 2. Applying FAMER to the data of Table1.

Figure 5 .
Figure 5.Running example processed with CLIP clusteringunion of all cluster sets CS i determined in this way for the different components combined with the previously determined clusters in phase 1 form the final output of CLIP (line 16).In the example of Figure5, we start with the link between a 2 and b 2 in the third graph for phase 2 and merge these entities into a new cluster.Then the link between b 1 and c 1 is selected and these entities are merged into one cluster as well.Then the link from c 1 to a 2 is taken.The clusters on the ends of this link are not compatible because both have one entity from source B. Processing all links in sorted order in the example leads to adding the entity a 1 to the cluster containing entities of index 1.Similarly, the entity d 4 is added to the cluster containing entities a 2 and b 2 .The output of phase 2, together with the output of phase 1, results in five clusters.Compared to the perfect result shown in Figure2, only three clusters (with indices 0, 1, 3) are correct while the entities with index 2 are not grouped together because they were not linked in the similarity graph due to the lossy blocking approach applied.The clustering in the second phase is an iterative process mainly based on link priority.The initial CLIP implementation of[7] updated link priorities in each iteration thereby causing high runtimes.The optimized CLIP version, described in this section, computes the link priorities only once and thus uses static priorities in the second phase.We found out that the new approach achieves about the same result quality but leads to much lower runtimes which will be presented in Section 5.

Figure 6 .
Figure 6.Match quality of clustering-based ER approaches.

Table 1 .
Sample person entities from evaluation dataset DS3.