A large quantitative analysis of written language challenges the idea that all languages are equally complex

Koplenig, Alexander; Wolfer, Sascha; Meyer, Peter

doi:10.21203/rs.3.rs-1462001/v3

2023

DOI: 10.21203/rs.3.rs-1462001/v3

|View full text |Cite

Preprint

A large quantitative analysis of written language challenges the idea that all languages are equally complex

Alexander Koplenig

Sascha Wolfer

Peter Meyer

Abstract: One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6,500 different documents as represented in 41 multilingual text collections consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language b… Show more

Help me understand this report

View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

Publication Types

Select...

Preprint1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

(20 citation statements)

References 96 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…faster) to learn: in a large-scale quantitative cross-linguistic analysis, ref. 25 trained an LM on more than 6,500 documents in over 2,000 different languages and statistically inferred the entropy rate of each document, which can be seen as an index of the underlying language complexity. 28,[33][34][35] The results showed that documents in languages with more speakers tended to be more complex.…”

mentioning

confidence: 99%

“…In this article, we first harness part of the data used by ref. 25 to explicitly test this hypothesis. Since the LM used by ref.…”

mentioning

confidence: 99%

“…Since the LM used by ref. 25 is rather simple, we train two more sophisticated LMs that use machine-and deep-learning on the data and compare results. We then discuss some potential limitations of the multilingual text collection used by ref.…”

mentioning

confidence: 99%

“…We then discuss some potential limitations of the multilingual text collection used by ref. 25 . To rule out that the results are driven by these limitations, we create two fully balanced and parallel multilingual corpora, which we use to train seven different LMsranging from very simple n-gram models to state-of-the-art deep neural networksand measure how difficult it is for each LM to build an adequate representation of the input.Importantly, previous research [36][37][38][39][40][41]21 has shown that cross-linguistic (and cross-cultural) studies that seek to analyse potential statistical associations between language features and external factors must take into account Galton's problem, which refers to the potential confounding of…”

mentioning

confidence: 99%

mentioning

confidence: 99%

See 4 more Smart Citations

Languages with more speakers tend to be harder to (machine-)learn

Koplenig,

Wolfer

2023

Preprint

View full text Add to dashboard Cite

Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ChatGPT chatbot, show impressive performance on a wide range of linguistic tasks, thus providing cognitive science and linguistics with a computational working model to empirically study different aspects of human language. Here, we use LMs to test the hypothesis that languages with more speakers tend to be easier to learn. In two experiments, we train several LMs – ranging from very simple n-gram models to state-of-the-art deep neural networks – on written cross-linguistic corpus data covering 1,294 different languages and statistically estimate learning difficulty. Using a variety of quantitative methods and machine learning techniques to account for phylogenetic relatedness and geographical proximity of languages, we show that there is robust evidence for a relationship between learning difficulty and speaker population size. However, contrary to expectations derived from previous research, our results suggest that languages with more speakers tend to be harder to learn.

show abstract

mentioning

confidence: 99%

“…In this article, we first harness part of the data used by ref. 25 to explicitly test this hypothesis. Since the LM used by ref.…”

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Languages with more speakers tend to be harder to (machine-)learn

Koplenig,

Wolfer

2023

Preprint

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

A large quantitative analysis of written language challenges the idea that all languages are equally complex

Cited by 1 publication

References 96 publications

Languages with more speakers tend to be harder to (machine-)learn

Languages with more speakers tend to be harder to (machine-)learn

Contact Info

Product

Resources

About