Advances in computer-assisted linguistic research have been greatly influential in reshaping linguistic research. With the increasing availability of interconnected datasets created and curated by researchers, more and more interwoven questions can now be investigated. Such advances, however, are bringing high requirements in terms of rigorousness for preparing and curating datasets. Here we present CLICS, a Database of Cross-Linguistic Colexifications (CLICS). CLICS tackles interconnected interdisciplinary research questions about the colexification of words across semantic categories in the world's languages, and show-cases best practices for preparing data for cross-linguistic research. This is done by addressing shortcomings of an earlier version of the database, CLICS2, and by supplying an updated version with CLICS3, which massively increases the size and scope of the project. We provide tools and guidelines for this purpose and discuss insights resulting from organizing student tasks for database updates.
Abstract:The Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of crosslinguistic colexification patterns. In its current form, it has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings, ranging from the underlying dataset, which still contains many errors, up to the limits of cross-linguistic colexification studies in general. Building on recent standardization efforts reflected in the Cross-Linguistic Data Formats initiative (CLDF) and novel approaches for fast, efficient, and reliable data aggregation, we have created a new database for cross-linguistic colexifications, which not only supersedes the original CLICS database in terms of coverage but also offers a much more principled procedure for the creation, curation and aggregation of datasets. The paper presents the new database and discusses its major features.
Contrary to what non-practitioners might expect, the systems of phonetic notation used by linguists are highly idiosyncratic. Not only do various linguistic subfields disagree on the specific symbols they use to denote the speech sounds of languages, but also in large databases of sound inventories considerable variation can be found. Inspired by recent efforts to link cross-linguistic data with help of reference catalogues (Glottolog, Concepticon) across different resources, we present initial efforts to link different phonetic notation systems to a catalogue of speech sounds. This is achieved with the help of a database accompanied by a software framework that uses a limited but easily extendable set of non-binary feature values to allow for quick and convenient registration of different transcription systems, while at the same time linking to additional datasets with restricted inventories. Linking different transcription systems enables us to conveniently translate between different phonetic transcription systems, while linking sounds to databases allows users quick access to various kinds of metadata, including feature values, statistics on phoneme inventories, and information on prosody and sound classes. In order to prove the feasibility of this enterprise, we supplement an initial version of our cross-linguistic database of phonetic transcription systems (CLTS), which currently registers five transcription systems and links to fifteen datasets, as well as a web application, which permits users to conveniently test the power of the automatic translation across transcription systems.
Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.