“…In this section, we keep our QuadSky steps but replace the labeling of the pairs with a supervised learning technique. We decided to compare the SkyEx-* family of algorithms with logistic regression [43], support vector machines (SVM) [44], decision trees [45], and Naive Bayes [46], which are supervised learning techniques commonly used in entity resolution problems [8,10,25,33,47]. We applied these methods on D full pairs that are at most 30 meters apart (dataset description in Table 3).…”
Section: Comparison With Supervised Learning Techniquesmentioning
confidence: 99%
“…What is more, the methods propose arbitrarily attribute weights and score functions without experimentation nor evaluation. In contrast to [11][12][13], the skyline-based algorithm (SkyEx) proposed in [10] is free of scoring functions and semi-arbitrary weights, and achieves good results. However, SkyEx is dependent on a threshold number of skylines k, which can only be discovered through experiments, as the authors do not provide methods for estimating k. To sum up, on the one hand, there is a growing amount of information about spatial entities, both within a single source and across sources, which can improve the quality of the geo-information; on the other hand, the spatial entity linkage problem is hard to resolve not only because of the heterogeneity of the data but also because of the lack of appropriate and effective methods.…”
Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities, describe them with different attributes, and sometimes provide contradicting information. Hence, we introduce the spatial entity linkage problem, which finds which pairs of spatial entities belong to the same physical spatial entity. Our proposed solution (QuadSky) starts with a time-efficient spatial blocking technique (QuadFlex), compares pairwise the spatial entities in the same block, ranks the pairs using Pareto optimality with the SkyRank algorithm, and finally, classifies the pairs with our novel SkyEx-* family of algorithms that yield 0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of 777,452 pairs. Moreover, we provide a theoretical guarantee and formalize the SkyEx-FES algorithm that explores only 27% of the skylines without any loss in F-measure. Furthermore, our fully unsupervised algorithm SkyEx-D approximates the optimal result with an F-measure loss of just 0.01. Finally, QuadSky provides the best trade-off between precision and recall, and the best F-measure compared to the existing baselines and clustering techniques, and approximates the results of supervised learning solutions.
“…In this section, we keep our QuadSky steps but replace the labeling of the pairs with a supervised learning technique. We decided to compare the SkyEx-* family of algorithms with logistic regression [43], support vector machines (SVM) [44], decision trees [45], and Naive Bayes [46], which are supervised learning techniques commonly used in entity resolution problems [8,10,25,33,47]. We applied these methods on D full pairs that are at most 30 meters apart (dataset description in Table 3).…”
Section: Comparison With Supervised Learning Techniquesmentioning
confidence: 99%
“…What is more, the methods propose arbitrarily attribute weights and score functions without experimentation nor evaluation. In contrast to [11][12][13], the skyline-based algorithm (SkyEx) proposed in [10] is free of scoring functions and semi-arbitrary weights, and achieves good results. However, SkyEx is dependent on a threshold number of skylines k, which can only be discovered through experiments, as the authors do not provide methods for estimating k. To sum up, on the one hand, there is a growing amount of information about spatial entities, both within a single source and across sources, which can improve the quality of the geo-information; on the other hand, the spatial entity linkage problem is hard to resolve not only because of the heterogeneity of the data but also because of the lack of appropriate and effective methods.…”
Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities, describe them with different attributes, and sometimes provide contradicting information. Hence, we introduce the spatial entity linkage problem, which finds which pairs of spatial entities belong to the same physical spatial entity. Our proposed solution (QuadSky) starts with a time-efficient spatial blocking technique (QuadFlex), compares pairwise the spatial entities in the same block, ranks the pairs using Pareto optimality with the SkyRank algorithm, and finally, classifies the pairs with our novel SkyEx-* family of algorithms that yield 0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of 777,452 pairs. Moreover, we provide a theoretical guarantee and formalize the SkyEx-FES algorithm that explores only 27% of the skylines without any loss in F-measure. Furthermore, our fully unsupervised algorithm SkyEx-D approximates the optimal result with an F-measure loss of just 0.01. Finally, QuadSky provides the best trade-off between precision and recall, and the best F-measure compared to the existing baselines and clustering techniques, and approximates the results of supervised learning solutions.
“…Similar synonyms describing the same problem have continuously appeared in the literature such as deduplication, entity resolution, entity matching, record linkage [63,65]. The entities that are matched can be of various fields, for example, profiles in social networks belonging to the same individual [56,66], bioinformatics data [67], biomedical data [68], publication data of the same author [65,69], genealogical data to find the human entities [70], records of the same product [65,69], etc. Regardless the field, the entity linkage follows, in principle, three main steps: blocking, entity comparison, and pair labeling [54,71] (Fig.…”
Section: Entity Linkagementioning
confidence: 92%
“…The user-based crawling navigates the geo-social data source using users as query parameters. The most popular method mentioned in several papers is Snowball [15,21,56]. Snowball starts with some initial users, known as the seed.…”
Section: User-based Crawlingmentioning
confidence: 99%
“…The spatial entities have attributes such as name, categories, etc. For measuring these similarities, the traditional similarity metrics such as Levenshtein, Jaccard, Cosine [56,80,83,91] show good results. However, more advanced metrics can better capture the similarity of two entities such as a Soft-TFIDF with Levenshtein [92], and traditional string similarities trained further with supervised machine learning [93].…”
General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.-Users may download and print one copy of any publication from the public portal for the purpose of private study or research. -You may not further distribute the material or use it for any profit-making activity or commercial gain -You may freely distribute the URL identifying the publication in the public portal -
Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities, describe them with different attributes, and sometimes provide contradicting information. Hence, we introduce the spatial entity linkage problem, which finds which pairs of spatial entities belong to the same physical spatial entity. Our proposed solution (QuadSky) starts with a spatial blocking technique (QuadFlex) that creates blocks of nearby spatial entities with the time complexity of the quadtree algorithm. After pairwise comparing the spatial entities in the same block, we propose the SkyRank algorithm that ranks the compared pairs using Pareto optimality. We introduce the SkyEx-* family of algorithms that can classify the pairs with 0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of 777,452 pairs. Moreover, our fully unsupervised algorithm SkyEx-D approximates the optimal result with an F-measure loss of just 0.01. Finally, QuadSky provides the best trade-off between precision and recall and the best F-measure compared to the existing baselines.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.