CogNet: A Large-Scale Cognate Database

Batsuren, Khuyagbaatar; Bella, Gábor; Giunchiglia, Fausto

doi:10.18653/v1/p19-1302

Cited by 14 publications

(16 citation statements)

References 15 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The resulting humanannotated dataset contained 8353 words, 62,752 pairs of cognate words and 587,357 pairs of non-cognate words. This set was significantly larger (by 80%) than the one we used in (Batsuren et al 2019a). We divided this dataset into two equal parts: the first 30 concepts for hyperparameter tuning (''tuning'') and the second 30 concepts for evaluation (''test'').…”

Section: Discussionmentioning

confidence: 97%

“…This section describes how CogNet was evaluated on a diverse set of cognate corpora, and how its parameters were tuned to optimise results. With respect to the evaluation dataset used in Batsuren et al (2019a), we have considerably extended the evaluation corpus size, and we have also incorporated a pre-existing cognate database into our evaluations. The creation of self-annotated evaluation datasets despite the existence of cognate databases was desirable due to the latter being either phonetic (and thus not usable for our purposes) or limited to very few language pairs (as the resource described below).…”

Section: Discussionmentioning

confidence: 99%

“…• CogNet v0, a preliminary version, used the UKC as its input lexical DB and relied only on direct etymological evidence from the Etymological WordNet and transitivity. • CogNet v1, described in Batsuren et al (2019a), still relied solely on the UKC but also included indirect evidence (transliteration, orthographic, and geographic). • CogNet v2, presented in this paper, significantly increased its coverage thanks to extending the input lexical DB by about 800 thousand words retrieved from the PanLex resource.…”

Section: Cognet Resources and Their Evolutionmentioning

confidence: 99%

“…With respect to an initial release of CogNet, introduced in (Batsuren et al, 2019a), this paper presents a method and a resource that have been greatly extended and redesigned for extensibility by new input resources. Consequently, the size of the CogNet database has been multiplied by 2.5 with respect to its first version.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A large and evolving cognate database

Batsuren

Bella

Giunchiglia

2021

Lang Resources & Evaluation

Self Cite

View full text Add to dashboard Cite

We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—words of common origin and meaning across languages. CogNet is continuously evolving: its current version contains over 8 million cognate pairs over 338 languages and 35 writing systems, with new releases already in preparation. The paper presents the algorithm and input resources used for its computation, an evaluation of the result, as well as a quantitative analysis of cognate data leading to novel insights on language diversity. Furthermore, as an example on the use of large-scale cross-lingual knowledge bases for improving the quality of multilingual applications, we present a case study on the use of CogNet for bilingual lexicon induction in the framework of cross-lingual transfer learning.

show abstract

Section: Discussionmentioning

confidence: 97%

Section: Discussionmentioning

confidence: 99%

Section: Cognet Resources and Their Evolutionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A large and evolving cognate database

Batsuren

Bella

Giunchiglia

2021

Lang Resources & Evaluation

Self Cite

View full text Add to dashboard Cite

show abstract

“…As shown in Figure 1, we exploited a cognate database, CogNet 4 (Batsuren et al, 2019(Batsuren et al, , 2021, that has 8.1M cognate pairs, for evidence on cognacy: cog(w A , w B ) = True is asserted by the presence of the word pair in CogNet.…”

Section: Derivation Enrichmentmentioning

confidence: 99%

MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology

Batsuren¹,

Bella²,

Giunchiglia³

2021

Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Self Cite

View full text Add to dashboard Cite

Large-scale morphological databases provide essential input to a wide range of NLP applications. Inflectional data is of particular importance for morphologically rich (agglutinative and highly inflecting) languages, and derivations can be used, e.g. to infer the semantics of out-of-vocabulary words. Extending the scope of state-of-the-art multilingual morphological databases, we announce the release of Mor-phyNet, a high-quality resource with 15 languages, 519k derivational and 10.1M inflectional entries, and a rich set of morphological features. MorphyNet was extracted from Wiktionary using both hand-crafted and automated methods, and was manually evaluated to be of a precision higher than 98%. Both the resource generation logic and the resulting database are made freely available 12 and are reusable as stand-alone tools or in combination with existing resources.

show abstract

A Database and Visualization of the Similarity of Contemporary Lexicons

Bella

Batsuren

Giunchiglia

2021

Text, Speech, and Dialogue

Self Cite

View full text Add to dashboard Cite

Lexical similarity data, quantifying the "proximity" of languages based on the similarity of their lexicons, has been increasingly used to estimate the cross-lingual reusability of language resources, for tasks such as bilingual lexicon induction or cross-lingual transfer. Existing similarity data, however, originates from the field of comparative linguistics, computed from very small expert-curated vocabularies that are not supposed to be representative of modern lexicons. We explore a different, fully automated approach to lexical similarity computation, based on an existing 8-million-entry cognate database created from online lexicons orders of magnitude larger than the word lists typically used in linguistics. We compare our results to earlier efforts, and automatically produce intuitive visualizations that have traditionally been hand-crafted. With a new, freely available database of over 27 thousand language pairs over 331 languages, we hope to provide more relevant data to cross-lingual NLP applications, as well as material for the synchronic study of contemporary lexicons.

show abstract

CogNet: A Large-Scale Cognate Database

Cited by 14 publications

References 15 publications

A large and evolving cognate database

A large and evolving cognate database

MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology

A Database and Visualization of the Similarity of Contemporary Lexicons

Contact Info

Product

Resources

About