Global Syntactic Variation in Seven Languages: Toward a Computational Dialectology

Dunn, Jonathan

doi:10.3389/frai.2019.00015

Cited by 17 publications

(11 citation statements)

References 54 publications

(75 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The inverse of this generalization is that individuals have unique or idiosyncratic constructions which are only revealed when the training corpus is centered around that individual. This finding fits well with studies in variation (Dunn, 2019b), Dunn2019a which reveal the high degree of syntactic differences across speech com- munities.…”

Section: Experiments 3 Perception Vs Production In Grammar Similaritysupporting

confidence: 90%

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

2021

View full text Add to dashboard Cite

Word concreteness and imageability have proven crucial in understanding how humans process and represent language in the brain. While word-embeddings do not explicitly incorporate the concreteness of words into their computations, they have been shown to accurately predict human judgments of concreteness and imageability. Inspired by the recent interest in using neural activity patterns to analyze distributed meaning representations, we first show that brain responses acquired while human subjects passively comprehend natural stories can significantly distinguish the concreteness levels of the words encountered. We then examine for the same task whether the additional perceptual information in the brain representations can complement the contextual information in the word-embeddings. However, the results of our predictive models and residual analyses indicate the contrary. We find that the relevant information in the brain representations is a subset of the relevant information in the contextualized wordembeddings, providing new insight into the existing state of natural language processing models.

show abstract

Section: Experiments 3 Perception Vs Production In Grammar Similaritysupporting

confidence: 90%

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

2021

View full text Add to dashboard Cite

show abstract

“…Variation within and between both datasets is structured more around individual languages and is less predictable given country-specific population and corpus size information. Work based on previous versions of the corpus (Dunn, 2019a(Dunn, , 2019b have shown that meaningful dialectal variation can be modeled using this source of data. The internal (corpus similarity) and external (demographic) evaluations in this paper strongly suggest that future work based on these expanded country-language sub-corpora will support further advances in corpus-based dialectology.…”

Section: Discussionmentioning

confidence: 99%

Mapping languages: the Corpus of Global Language Use

Dunn

2020

Lang Resources & Evaluation

Self Cite

View full text Add to dashboard Cite

This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports more local languages with smaller sample sizes than alternative off-the-shelf models. Improved language identification is essential for moving beyond majority languages. Given the focus on language mapping, the paper analyzes how well this digital language data represents actual populations by (i) systematically comparing the corpus with demographic ground-truth data and (ii) triangulating the corpus with an alternate Twitter-based dataset. In total, the corpus contains 423 billion words representing 148 languages (with over 1 million words from each language) and 158 countries (again with over 1 million words from each country), all distilled from Common Crawl web data. The main contribution of this paper, in addition to describing this publicly-available corpus, is to provide a comprehensive analysis of the relationship between two sources of digital data (the web and Twitter) as well as their connection to underlying populations.

show abstract

“…The grammar induction algorithm used here employs an association-based beam search to identify the best sequences of slot-constraints (Dunn, 2019a). While a grammar formalism like dependency grammar (Nivre and McDonald, 2008;Zhang and Nivre, 2012) must identify the head and attachment type for each word, a construction grammar must identify the representation type for each slot-constraint.…”

Section: Methods: Computational Cxgmentioning

confidence: 99%

“…However, because these two types of representations operate at different levels of complexity, it is possible that they grow at different rates. We thus experiment with the growth of a computational construction grammar (Dunn, 2018b(Dunn, , 2019a across data drawn from six different registers: news articles, Wikipedia articles, web pages, tweets, academic papers, and published books. These experiments are needed to establish a baseline relationship between the grammar and the lexicon for the experiments to follow.…”

mentioning

confidence: 99%

Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction

Dunn¹,

Nini²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper asks whether a distinction between production-based and perception-based grammar induction influences either (i) the growth curve of grammars and lexicons or (ii) the similarity between representations learned from independent sub-sets of a corpus. A productionbased model is trained on the usage of a single individual, thus simulating the grammatical knowledge of a single speaker. A perception-based model is trained on an aggregation of many individuals, thus simulating grammatical generalizations learned from exposure to many different speakers. To ensure robustness, the experiments are replicated across two registers of written English, with four additional registers reserved as a control. A set of three computational experiments shows that production-based grammars are significantly different from perception-based grammars across all conditions, with a steeper growth curve that can be explained by substantial inter-individual grammatical differences. The Role of Individuals in Usage-Based Grammar InductionThis paper experiments with the interaction between the amount of exposure (the size of a training corpus) and the number of representations learned (the size of the grammar and lexicon) under perception-based vs production-based grammar induction. The basic idea behind these experiments is to test the degree to which computational con-

show abstract

Global Syntactic Variation in Seven Languages: Toward a Computational Dialectology

Cited by 17 publications

References 54 publications

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Mapping languages: the Corpus of Global Language Use

Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction

Contact Info

Product

Resources

About