This paper makes two contributions. First, we introduce a new method for computational cladistics that produces a rooted tree by minimising the number of homoplasies. This method is compared with lexicostatistics and maximum parsimony. We validate the method on Indo-European data and show that the tree derived is consistent with current understanding of the internal cladistics of that family. Secondly, we turn the method to treat the less well studied problem of the internal cladistics of Afro-Asiatic. We show that there is good evidence for a North/South division in Afro-Asiatic with Berber, Egyptian and Semitic in the North and Chadic, Cushitic and Omotic in the South. There is also tentative evidence for further grouping of Egyptian with Semitic and Cushitic with Omotic.
INTRODUCTIONThe objective of this paper is to introduce a new method for computational cladistics and to show the results of that method on an Afro-Asiatic dataset. The goal of cladistics is to correctly identify clades, defined as ancestral speech communities identified by sharing some linguistic innovation (Hennig, 1966). Crucially, a clade is formed when a group of speakers depart from their neighbouring communities by developing a linguistic innovation (e.g., sound change, morphological reorganization or lexical innovation). Ideally, these clades correspond to historical speech communities, which continued to further sub-divide leading to the modern situation of languages belonging to larger language families. This situation of ever sub-dividing language families leads to the development of cladograms, which represent how various attested languages cluster into larger and larger clades. In many cases, the larger cladistic units are already identified (e.g., Proto-Indo-European). However, the internal cladistics of the language family remain debatable, i.e. how do the various languages in the family cluster within their larger family. The implications of these clades are not only of linguistic importance, but are also relevant for the study of historic and pre-historic population movements (or at least cultural diffusion). Since clades represent historical speech communities, the sub-division of clades in the cladogram reflect the dispersion of a once unified speech community at some point in the past. Thus, linguistic cladistics informs and can be informed by the archaeological record (Blench 2001;Bouckaert et al. 2012;Anthony and Ringe 2015;Chang et al. 2015).This paper presents a novel method for identifying clades computationally from a database containing lists of linguistic features from related languages. We start by giving a more in-depth background on the assumptions that we make about cladistics and computational cladistics in particular. We, then, discuss previous computational cladistic