A cross-linguistic database of phonetic transcription systems

Anderson, Cormac; Tresoldi, Tiago; Chacon, Thiago Costa; Fehn, Anne-Maria; Walworth, Mary; Forkel, Robert; List, Johann‐Mattis

doi:10.2478/yplm-2018-0002

Cited by 37 publications

(29 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Along with the growing amount of digitally available data for the world's languages, we find a substantial increase in the application of new quantitative techniques. While most of the new methods are inspired by neighboring disciplines and general-purpose frameworks, such as evolutionary biology 1,2 , machine learning 3,4 , or statistical modeling 5,6 , the particularities of cross-linguistic data often necessitate a specific treatment of materials (reflected in recent standardization efforts 7,8 ) and methods (illustrated by the development of new algorithms tackling specifically linguistic problems 9,10 ).…”

Section: Background and Summarymentioning

confidence: 99%

The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies

et al. 2020

Self Cite

View full text Add to dashboard Cite

Advances in computer-assisted linguistic research have been greatly influential in reshaping linguistic research. With the increasing availability of interconnected datasets created and curated by researchers, more and more interwoven questions can now be investigated. Such advances, however, are bringing high requirements in terms of rigorousness for preparing and curating datasets. Here we present CLICS, a Database of Cross-Linguistic Colexifications (CLICS). CLICS tackles interconnected interdisciplinary research questions about the colexification of words across semantic categories in the world's languages, and show-cases best practices for preparing data for cross-linguistic research. This is done by addressing shortcomings of an earlier version of the database, CLICS2, and by supplying an updated version with CLICS3, which massively increases the size and scope of the project. We provide tools and guidelines for this purpose and discuss insights resulting from organizing student tasks for database updates.

show abstract

Section: Background and Summarymentioning

confidence: 99%

The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…First, to tokenize the data (split it up into sound segments), an orthography profile, as outlined in Wu et al (2020), was used by the Cross-Linguistic Data Formats Bench (Forkel and List, 2020) on the raw data. The CLDF Bench uses CLTS, or Cross-Linguistic Transcription Systems (Anderson et al, 2018), to consolidate transcriptions of words done by different linguists.…”

Section: Methodsmentioning

confidence: 99%

Automated phylogeny of Palaung dialects

Lee¹

2021

KWPL

View full text Add to dashboard Cite

Improved methods in automatic cognate detection have recently been used by historical linguists to help determine the subgrouping of a clade of languages or dialects, capitalizing on the efficiency of computers when handling substantial amounts of data. In this paper, 16 Palaung dialects are examined using various methods of automatic cognate detection. Partial and whole cognate detection are used together with Lexstat and Sound Class Alignment to generate a phylogenetic tree of these dialects. The results of these methods are compared to the results using cognate detection decided by human experts (Deepadung et al., 2015). These results are substantially similar, suggesting that automatic cognate and phylogeny detection using algorithms is a viable complement to historical linguistic research. Accompanying this paper is a tutorial for the automated cognate detection and phylogeny procedure that was used. By following the steps, users can create results based on segmenting the morphemes of Palaung dialects differently.

show abstract

“…Since translations may lack or one concept may have been represented by more than one word form, the resulting wordlists comprise between 956 and 2,558 word forms. While word forms were provided in orthographic form or phonological transcriptions in the original data, we added phonetic transcriptions which follow the unified Broad IPA transcription system proposed by the Cross-Linguistic Transcription Systems reference catalog [ 33 , 34 ] with the help of orthography profiles [ 35 ] manually compiled by reading the relevant literature for each language. Orthography profiles can be best thought of as a specific look-up table, which allows to convert transcriptions from one orthography into another one (compare the presentation in Wu et al [ 36 ] for details); while such assisted transcription can introduce noise in the data, no comparable lexical database with transcriptions and loanword annotation was available.…”

Section: Methodsmentioning

confidence: 99%

Using lexical language models to detect borrowings in monolingual wordlists

et al. 2020

Self Cite

View full text Add to dashboard Cite

Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.

show abstract

A cross-linguistic database of phonetic transcription systems

Cited by 37 publications

References 18 publications

The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies

The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies

Automated phylogeny of Palaung dialects

Using lexical language models to detect borrowings in monolingual wordlists

Contact Info

Product

Resources

About