Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models

Stanczak, Karolina; Ponti, Edoardo Maria; Hennigen, Lucas Torroba; Cotterell, Ryan; Augenstein, Isabelle

doi:10.18653/v1/2022.naacl-main.114

Cited by 9 publications

(9 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We speculate that this low number of values leads to low variation among languages, thus the non-significant difference. This finding concurs with Stanczak et al (2022), who observed a negative correlation between the number of values per morphosyntactic category and the proportion of language pairs with significant neuron overlap. Hence, the lack of significant differences in variance between the diverse and related sets can be attributed to the substantial overlap of neurons across language pairs.…”

Section: Language Proximity and Low-resource Conditionssupporting

confidence: 91%

“…This observation holds true for all categories, with the exception of Animacy, which is predominantly found in Slavic languages within our dataset. This aligns with the findings of Stanczak et al (2022), who noted that the correlation analysis results can be affected by whether a category is typical for a specific genus. Next, we further explore the relationship between signature values and language properties.…”

Section: Logogram Vs Phonogramsupporting

confidence: 90%

“…Intrinsic probing, on the other hand, explores the internal structure of linguistic information within representations (Torroba Hennigen et al, 2020). Stanczak et al (2022) conducted a large-scale empirical study over two multilingual pre-trained models, mBERT, and XLM-R, and investigated whether morphosyntactic information is encoded in the same subset of neurons in different languages. Their findings reveal that there is considerable cross-lingual overlap between neurons, but the magnitude varies among categories and is dependent on language proximity and pre-training data size.…”

Section: Related Workmentioning

confidence: 99%

“…We provide a list of morphosyntactic categories we use in Appendix A. We follow Stanczak et al (2022) and use the converter to switch morphosyntactic annotations from UD v2.1 to UniMorph schema.…”

Section: B Additional Details For Experiments B1 Details For Datamentioning

confidence: 99%

See 3 more Smart Citations

A Joint Matrix Factorization Analysis of Multilingual Representations

Zhao,

Ziser,

Webber

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models. An alternative to probing, this tool allows us to analyze multiple sets of representations in a joint manner. Using this tool, we study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained models. We conduct a large-scale empirical study of over 33 languages and 17 morphosyntactic categories. Our findings demonstrate variations in the encoding of morphosyntactic information across upper and lower layers, with category-specific differences influenced by language properties. Hierarchical clustering of the factorization outputs yields a tree structure that is related to phylogenetic trees manually crafted by linguists. Moreover, we find the factorization outputs exhibit strong associations with performance observed across different cross-lingual tasks. We release our code to facilitate future research. 1 Experiment-Control Modeling for Multilingual AnalysisWe employ factor analysis to generate a distinctive signature for a group of representations within an

show abstract

Section: Language Proximity and Low-resource Conditionssupporting

confidence: 91%

Section: Logogram Vs Phonogramsupporting

confidence: 90%

Section: Related Workmentioning

confidence: 99%

Section: B Additional Details For Experiments B1 Details For Datamentioning

confidence: 99%

See 2 more Smart Citations

A Joint Matrix Factorization Analysis of Multilingual Representations

Zhao,

Ziser,

Webber

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…Particularly, they proposed to rank frozen encoder representations by determining the percentage of trees that are recoverable from them, and based on that ranking choose which LLM to plug. Focused on morphology, Stanczak et al (2022) showed that subsets of neurons model morphosyntax across a variety of languages in multilingual LLMs.…”

Section: Related Workmentioning

confidence: 99%

Assessment of Pre-Trained Models Across Languages and Grammars

Muñoz-Ortiz,

Vilares,

Gómez-Rodríguez

2023

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacifi

View full text Add to dashboard Cite

We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.

show abstract

Data-driven Cross-lingual Syntax: An Agreement Study with Massively Multilingual Models

Varda

Marelli

2023

Computational Linguistics

View full text Add to dashboard Cite

Massively multilingual models such as mBERT and XLM-R are increasingly valued in Natural Language Processing research and applications, due to their ability to tackle the uneven distribution of resources available for different languages. The models’ ability to process multiple languages relying on a shared set of parameters raises the question of whether the grammatical knowledge they extracted during pre-training can be considered as a data-driven cross-lingual grammar. The present work studies the inner workings of mBERT and XLM-R in order to test the cross-lingual consistency of the individual neural units that respond to a precise syntactic phenomenon, that is, number agreement, in five languages (English, German, French, Hebrew, Russian). We found that there is a significant overlap in the latent dimensions that encode agreement across the languages we considered. This overlap is larger (a) for long- vis-à-vis short-distance agreement and (b) when considering XLM-R as compared to mBERT, and peaks in the intermediate layers of the network. We further show that a small set of syntax-sensitive neurons can capture agreement violations across languages; however, their contribution is not decisive in agreement processing.

show abstract

Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models

Cited by 9 publications

References 14 publications

A Joint Matrix Factorization Analysis of Multilingual Representations

A Joint Matrix Factorization Analysis of Multilingual Representations

Assessment of Pre-Trained Models Across Languages and Grammars

Data-driven Cross-lingual Syntax: An Agreement Study with Massively Multilingual Models

Contact Info

Product

Resources

About