Dan Garrette scite author profile

In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2019) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.

show abstract

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

Clark

Choi

Collins

et al. 2020

Transactions of the Association for Computational Linguistics

231

244

View full text Add to dashboard Cite

Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.

show abstract

How multilingual is Multilingual BERT?

Pires

Schlinger

Garrette

2019

Preprint

View full text Add to dashboard Cite

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Scao¹,

Fan²,

Akiki³

et al. 2022

Preprint

View full text Add to dashboard Cite

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Clark

Garrette

Turc

et al. 2022

View full text Add to dashboard Cite

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.

show abstract

XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation

Ruder¹,

Constant²,

Botha³

et al. 2021

View full text Add to dashboard Cite

Machine learning has brought striking advances in multilingual natural language processing capabilities over the past year. For example, the latest techniques have improved the state-of-the-art performance on the XTREME multilingual benchmark by more than 13 points. While a sizeable gap to humanlevel performance remains, improvements have been easier to achieve in some tasks than in others. This paper analyzes the current state of cross-lingual transfer learning and summarizes some lessons learned. In order to catalyze meaningful progress, we extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks, including challenging language-agnostic retrieval tasks, and covers 50 typologically diverse languages. In addition, we provide a massively multilingual diagnostic suite (MULTICHECKLIST) and finegrained multi-dataset evaluation capabilities through an interactive public leaderboard to gain a better understanding of such models. 2020. XTREME: A Massively Multilingual Multitask Benchmark for Evaluating Cross-lingual Generalization. In Proceedings of ICML 2020.

show abstract

A Formal Approach to Linking Logical Form and Vector-Space Lexical Semantics

Garrette

Erk

Mooney

2014

View full text Add to dashboard Cite

First-order logic provides a powerful and flexible mechanism for representing natural language semantics. However, it is an open question of how best to integrate it with uncertain, weighted knowledge, for example regarding word meaning. This paper describes a mapping between predicates of logical form and points in a vector space. This mapping is then used to project distributional inferences to inference rules in logical form. We then describe first steps of an approach that uses this mapping to recast first-order semantics into the probabilistic models that are part of Statistical Relational AI. Specifically, we show how Discourse Representation Structures can be combined with distributional models for word meaning inside a Markov Logic Network and used to successfully perform inferences that take advantage of logical concepts such as negation and factivity as well as weighted information on word meaning in context.

show abstract

Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification

Ball¹,

Garrette²

2018

View full text Add to dashboard Cite

Code-switching, the use of more than one language within a single utterance, is ubiquitous in much of the world, but remains a challenge for NLP largely due to the lack of representative data for training models. In this paper, we present a novel model architecture that is trained exclusively on monolingual resources, but can be applied to unseen codeswitched text at inference time. The model accomplishes this by jointly maintaining separate word representations for each of the possible languages-or scripts in the case of transliteration-allowing each to contribute to inferences without forcing the model to commit to a language. Experiments on Hindi-English part-of-speech tagging demonstrate that our approach outperforms standard models when training on monolingual text without transliteration, and testing on code-switched text with alternate scripts.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Dan Garrette

How Multilingual is Multilingual BERT?

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

How multilingual is Multilingual BERT?

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation

A Formal Approach to Linking Logical Form and Vector-Space Lexical Semantics

Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification

Contact Info

Product

Resources

About