A Comparison of String Similarity Measures for Toponym Matching

UNKNOWN,

doi:10.1145/2534848.2534850

Cited by 19 publications

(18 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The metasimilarity proposed in [5] takes into account accentuation and other languagespecific aspects of toponym names, in a four-stages process. The set of algorithms evaluated in [2] were assessed in the toponym interlinking problem by [12]. The authors experimented on place names listed in the GEOnet Names Server, that contains romanized toponyms from 11 different countries.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Learning Advanced Similarities and Training Features for Toponym Interlinking

Giannopoulos

Kaffes

Kostoulas

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Interlinking of spatio-textual entities is an open and quite challenging research problem, with application in several commercial fields, including geomarketing, navigation and social networks. It comprises the process of identifying, between different data sources, entity descriptions that refer to the same real-world entity. In this work, we focus on toponym interlinking, that is we handle spatio-textual entities that are exclusively represented by their name; additional properties, such as categories, coordinates, etc. are considered as either absent or of too low quality to be exploited in this setting. Toponyms are inherently heterogeneous entities; quite often several alternative names exist for the same toponym, with varying degrees of similarity between these names. State of the art approaches adopt mostly generic, domain-agnostic similarity functions and use them as is, or incorporate them as training features within classifiers for performing toponym interlinking. We claim that capturing the specificities of toponyms and exploiting them into elaborate meta-similarity functions and derived training features can significantly increase the effectiveness of interlinking methods. To this end, we propose the LGM-Sim meta-similarity function and a series of novel, similarity-based and statistical training features that can be utilized in similarity-based and classification-based interlinking settings respectively. We demonstrate that the proposed methods achieve large increases in accuracy, in both settings, compared to several methods from the literature in the widely used Geonames toponym dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

“…We note that the baselines we compare with cover a large part of the presented literature, presented in the following papers:[2,3,5,12,14].…”

mentioning

confidence: 99%

Learning Advanced Similarities and Training Features for Toponym Interlinking

Giannopoulos

Kaffes

Kostoulas

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Comparison of models and algorithms used for highlighting requires arrays of similar strings of various origins [1], which are usually comes from either unpublished personal data arrays [2][3][4][5], or from hand-marked linguistic corps or thesauri, as in [6], or from artificially generated data [7]. The public unavailability of some excludes the reproducibility of experiments and an independent assessment of the quality of the initial data, while the high labor-consuming nature of others also limits their volume and availability.…”

Section: Introductionmentioning

confidence: 99%

“…Values of errors of metrics in the group(4 ) ofthe languages pairs ({de, hu}, {en, hu}, {eo, hu}, {fr, hu}) 8% ± 2.4% 29.5% ± 1.0% 30.4% ± 1.4% 31.2% ± 2.3% lcs 33.2% ± 2.4% 30.8% ± 1.3% 32.2% ± 1.8% 32.3% ± 2.2% qgram3 33.4% ± 3.3% 28.9% ± 0.6% 33.0% ± 1.5% 32.3% ± 2.9% dl,lv,osa 34.3% ± 4.3% 32.6% ± 1.0% 33.5% ± 1.1% 33.6% ± 2.9% cosine3 35.1% ± 3.1% 31.9% ± 0.6% 35.6% ± 1.4% 34.7% ± 2.6% AVERAGE 36.3% ± 2.2% 33.6% ± 1.0% 35.6% ± 0.9% 35.5% ± 1.9% qgram2 37.0% ± 3.2% 34.7% ± 0.8% 37.8% ± 1.9% 36.8% ± 2.6% qgram1 39.6% ± 3.4% 37.4% ± 1.0% 39.9% ± 1.3% 39.3% ± 2.5% jwp 41.0% ± 2.3% 36.3% ± 1.3% 39.6% ± 1.4% 39.5% ± 2.5% jw 41.1% ± 2.4% 36.9% ± 1.2% 40.0% ± 1.4% 39.8% ± 2.4% cosine2 40.8% ± 3.5% 38.5% ± 0.8% 41.2% ± 1.0% 40.5% ± 2.5% cosine1 43.1% ± 3.4% 43.4% ± 1.1% 46.0% ± 0.8% 44.3% ± 2.7%…”

mentioning

confidence: 99%

Stable assessment of the quality of similarity algorithms of character strings and their normalizations

Znamenskij

2018

ПСТП

View full text Add to dashboard Cite

The choice of search tools for hidden commonality in the data of a new nature requires stable and reproducible comparative assessments of the quality of abstract algorithms for the proximity of symbol strings. Conventional estimates based on artificially generated or manually labeled tests vary significantly, rather evaluating the method of this artificial generation with respect to similarity algorithms, and estimates based on user data cannot be accurately reproduced. A simple, transparent, objective and reproducible numerical quality assessment of a string metric. Parallel texts of book translations in different languages are used. The quality of a measure is estimated by the percentage of errors in possible different tries of determining the translation of a given paragraph among two paragraphs of a book in another language, one of which is actually a translation. The stability of assessments is verified by independence from the choice of a book and a pair of languages. The numerical experiment steadily ranked by quality algorithms for abstract character string comparisons and showed a strong dependence on the choice of normalization.

show abstract

“…Сравнение используемых для выделения моделей и алгоритмов нуждается в массивах похожих строк различного происхождения [1], в качестве которых обычно используют либо публично недоступные массивы персональных данных [2][3][4][5], либо вручную размеченные лингвистические корпусы или тезаурусы, как в [6], а иногда и искусственно сгенерированные данные [7]. Закрытость одних исключает воспроизводимость экспериментов и независимую оценку качества исходных данных, а высокая трудозатратность других тоже ограничивает их объём и доступность.…”

Section: Introductionunclassified

Stable assessment of the quality of similarity algorithms of character strings and their normalizations

Znamenskii¹

2018

ПСТП

View full text Add to dashboard Cite

Выбор средств поиска скрытой общности в данных новой природы требует устойчивых и воспроизводимых сравнительных оценок качества абстрактных алгоритмов близости символьных строк. Обычные оценка на основе искусственно сгенерированных или вручную размеченных тестов существенно разнятся, надёжнее оценивая метод этой искусственной генерации по отношению к алгоритмам сходства, а оценки на базе данных пользователей не могут быть точно воспроизведены. Предложена простая, прозрачная, объективная и воспроизводимая численная оценка качества метрики на строках. Используются параллельные тексты переводов книг на разные языки. Качество меры оценивается процентом ошибок в возможных различных попытках определения перевода данного абзаца среди двух абзацев книги на другом языке, один из которых действительно является переводом. Устойчивость оценок верифицируется независимостью от выбора книги и пары языков. Численный эксперимент устойчиво отранжировал по качеству абстрактные алгоритмы сравнения символьных строк и показал сильную зависимость от выбора нормализации.

show abstract

A Comparison of String Similarity Measures for Toponym Matching

Cited by 19 publications

References 28 publications

Learning Advanced Similarities and Training Features for Toponym Interlinking

Learning Advanced Similarities and Training Features for Toponym Interlinking

Stable assessment of the quality of similarity algorithms of character strings and their normalizations

Stable assessment of the quality of similarity algorithms of character strings and their normalizations

Contact Info

Product

Resources

About