Proceedings of the First ACM SIGSPATIAL International Workshop on Computational Models of Place 2013
DOI: 10.1145/2534848.2534850
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of String Similarity Measures for Toponym Matching

Abstract: The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
15
0
1

Year Published

2015
2015
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 19 publications
(18 citation statements)
references
References 28 publications
0
15
0
1
Order By: Relevance
“…The metasimilarity proposed in [5] takes into account accentuation and other languagespecific aspects of toponym names, in a four-stages process. The set of algorithms evaluated in [2] were assessed in the toponym interlinking problem by [12]. The authors experimented on place names listed in the GEOnet Names Server, that contains romanized toponyms from 11 different countries.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The metasimilarity proposed in [5] takes into account accentuation and other languagespecific aspects of toponym names, in a four-stages process. The set of algorithms evaluated in [2] were assessed in the toponym interlinking problem by [12]. The authors experimented on place names listed in the GEOnet Names Server, that contains romanized toponyms from 11 different countries.…”
Section: Related Workmentioning
confidence: 99%
“…We note that the baselines we compare with cover a large part of the presented literature, presented in the following papers:[2,3,5,12,14].…”
mentioning
confidence: 99%
“…Comparison of models and algorithms used for highlighting requires arrays of similar strings of various origins [1], which are usually comes from either unpublished personal data arrays [2][3][4][5], or from hand-marked linguistic corps or thesauri, as in [6], or from artificially generated data [7]. The public unavailability of some excludes the reproducibility of experiments and an independent assessment of the quality of the initial data, while the high labor-consuming nature of others also limits their volume and availability.…”
Section: Introductionmentioning
confidence: 99%
“…Values of errors of metrics in the group(4 ) ofthe languages pairs ({de, hu}, {en, hu}, {eo, hu}, {fr, hu}) 8% ± 2.4% 29.5% ± 1.0% 30.4% ± 1.4% 31.2% ± 2.3% lcs 33.2% ± 2.4% 30.8% ± 1.3% 32.2% ± 1.8% 32.3% ± 2.2% qgram3 33.4% ± 3.3% 28.9% ± 0.6% 33.0% ± 1.5% 32.3% ± 2.9% dl,lv,osa 34.3% ± 4.3% 32.6% ± 1.0% 33.5% ± 1.1% 33.6% ± 2.9% cosine3 35.1% ± 3.1% 31.9% ± 0.6% 35.6% ± 1.4% 34.7% ± 2.6% AVERAGE 36.3% ± 2.2% 33.6% ± 1.0% 35.6% ± 0.9% 35.5% ± 1.9% qgram2 37.0% ± 3.2% 34.7% ± 0.8% 37.8% ± 1.9% 36.8% ± 2.6% qgram1 39.6% ± 3.4% 37.4% ± 1.0% 39.9% ± 1.3% 39.3% ± 2.5% jwp 41.0% ± 2.3% 36.3% ± 1.3% 39.6% ± 1.4% 39.5% ± 2.5% jw 41.1% ± 2.4% 36.9% ± 1.2% 40.0% ± 1.4% 39.8% ± 2.4% cosine2 40.8% ± 3.5% 38.5% ± 0.8% 41.2% ± 1.0% 40.5% ± 2.5% cosine1 43.1% ± 3.4% 43.4% ± 1.1% 46.0% ± 0.8% 44.3% ± 2.7%…”
mentioning
confidence: 99%
“…Сравнение используемых для выделения моделей и алгоритмов нуждается в массивах похожих строк различного происхождения [1], в качестве которых обычно используют либо публично недоступные массивы персональных данных [2][3][4][5], либо вручную размеченные лингвистические корпусы или тезаурусы, как в [6], а иногда и искусственно сгенерированные данные [7]. Закрытость одних исключает воспроизводимость экспериментов и независимую оценку качества исходных данных, а высокая трудозатратность других тоже ограничивает их объём и доступность.…”
Section: Introductionunclassified