Are Automatic Methods for Cognate Detection Good Enough for
            Phylogenetic Reconstruction in Historical Linguistics?

Rama, Taraka; List, Johann‐Mattis; Wahle, Johannes; Jäger, Gerhard

doi:10.18653/v1/n18-2063

Cited by 37 publications

(33 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Automated cognate detection. There is a rich literature on developing automated cognate detection methods [28,29] for the purpose of detecting cognates and inferring phylogenetic trees [30,31]. The automated cognate detection methods compute a similarity between two words based on hand-crafted phonetic similarity measures [32,33], linear classifiers using word similarity scores [34,35] or phoneme n-grams as features for training [36,37] on hand-annotated training data, and neural networks [38].…”

Section: Methodsmentioning

confidence: 99%

“…Recent research in computational historical linguistics [30,46] has shown that the trees inferred using cognates inferred from automated methods are as good as those inferred from expert annotated cognate judgments. We believe that the next area for application of these cognate detection methods is in linguistic dating, because the dating process has traditionally been heavily dependent on manual cognate detection, which is time consuming, potentially biased, and not yet available for most of the world's language families.…”

Section: Methodsmentioning

confidence: 99%

“…We employ the automated cognate identification system described above [42] to assign cognate judgments to word lists. The choice of using such a system is supported by the results from two studies [30,46], which show that automatically inferred cognates can yield high quality phylogenetic trees. In addition to the cognate characters, we operate with sound class characters, which are extracted as follows.…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

A test of Generalized Bayesian dating: A new linguistic dating method

Rama

Wichmann

2020

PLoS ONE

Self Cite

View full text Add to dashboard Cite

In current practice, when dating the root of a Bayesian language phylogeny the researcher is required to supply some of the information beforehand, including a distribution of root ages and dates for some nodes serving as calibration points. In addition to the potential subjectivity that this leaves room for, the problem arises that for many of the language families of the world there are no available internal calibration points. Here we address the following questions: Can a new Bayesian framework which overcomes these problems be introduced and how well does it perform? The new framework that we present is generalized in the sense that no family-specific priors or calibration points are needed. We moreover introduce a way to overcome another potential source of subjectivity in Bayesian tree inference as commonly practiced, namely that of manual cognate identification; instead, we apply an automated approach. Dates are obtained by fitting a Gamma regression model to tree lengths and known time depths for 30 phylogenetically independent calibration points. This model is used to predict the time depths of both the root and the internal nodes for 116 language families, producing a total of 1,287 dates for families and subgroups. It turns out that results are similar to those of published Bayesian studies of individual language families. The performance of the method is compared to automated glottochronology, which is an update of the classical method of Swadesh drawing upon automated cognate recognition and a new formula for deriving a time depth from percentages of shared cognates. It is also compared to a third dating method, that of the Automated Similarity Judgment Program (ASJP). In terms of errors and correlations with known dates, ASJP works better than the new method and both work better than automated glottochronology.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A test of Generalized Bayesian dating: A new linguistic dating method

Rama

Wichmann

2020

PLoS ONE

Self Cite

View full text Add to dashboard Cite

show abstract

“…Phonetic similarity measures, however, require phonetic transcriptions to be a priori available. More recently, historical linguists have started exploiting identified cognates to infer phylogenetic relationships across languages (Rama et al, 2018;Jäger, 2018).…”

Section: State Of the Artmentioning

confidence: 99%

CogNet: A Large-Scale Cognate Database

Batsuren¹,

Bella²,

Giunchiglia³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

This paper introduces CogNet, a new, large-scale lexical database that provides cognates-words of common origin and meaning-across languages. The database currently contains 3.1 million cognate pairs across 338 languages using 35 writing systems. The paper also describes the automated method by which cognates were computed from publicly available wordnets, with an accuracy evaluated to 94%. Finally, statistics and early insights about the cognate data are presented, hinting at a possible future exploitation of the resource 1 by various fields of lingustics.

show abstract

“…In a separate paper, Rama et al (2018) presented pruned datasets for five different language families -Pama-Nyungan and Sino-Tibetan in addition to Austronesian, Austro-Asiatic, and Indo-European -consisting of only those languages that show the highest mutual lexical coverage. For each dataset, the authors pruned any language which has less than 75% mutual attestations with the rest of the languages.…”

Section: Effect Of Lexical Coveragementioning

confidence: 99%

Similarity Dependent

Rama¹

2018

Proceedings of the 22nd Conference on Computational Natural Language Learning

Self Cite

View full text Add to dashboard Cite

We present and evaluate two similarity dependent Chinese Restaurant Process (sd-CRP) algorithms at the task of automated cognate detection. The sd-CRP clustering algorithms do not require any predefined threshold for detecting cognate sets in a multilingual word list. We evaluate the performance of the algorithms on six language families (more than 750 languages) and find that both the sd-CRP variants performs as well as InfoMap and better than UPGMA at the task of inferring cognate clusters. The algorithms presented in this paper are family agnostic and can be applied to any linguistically under-studied language family.

show abstract

Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics?

Cited by 37 publications

References 43 publications

A test of Generalized Bayesian dating: A new linguistic dating method

A test of Generalized Bayesian dating: A new linguistic dating method

CogNet: A Large-Scale Cognate Database

Similarity Dependent

Contact Info

Product

Resources

About