2023
DOI: 10.21203/rs.3.rs-1462001/v3
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A large quantitative analysis of written language challenges the idea that all languages are equally complex

Abstract: One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6,500 different documents as represented in 41 multilingual text collections consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language b… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

1
18
1

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(20 citation statements)
references
References 96 publications
1
18
1
Order By: Relevance
“…faster) to learn: in a large-scale quantitative cross-linguistic analysis, ref. 25 trained an LM on more than 6,500 documents in over 2,000 different languages and statistically inferred the entropy rate of each document, which can be seen as an index of the underlying language complexity. 28,[33][34][35] The results showed that documents in languages with more speakers tended to be more complex.…”
mentioning
confidence: 99%
See 4 more Smart Citations
“…faster) to learn: in a large-scale quantitative cross-linguistic analysis, ref. 25 trained an LM on more than 6,500 documents in over 2,000 different languages and statistically inferred the entropy rate of each document, which can be seen as an index of the underlying language complexity. 28,[33][34][35] The results showed that documents in languages with more speakers tended to be more complex.…”
mentioning
confidence: 99%
“…In this article, we first harness part of the data used by ref. 25 to explicitly test this hypothesis. Since the LM used by ref.…”
mentioning
confidence: 99%
See 3 more Smart Citations