RobBERT: a Dutch RoBERTa-based Language Model

Delobelle, Pieter; Winters, Thomas; Berendt, Bettina

doi:10.48550/arxiv.2001.06286

Cited by 25 publications

(35 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While shared subword vocabularies proved to be a practical compromise that allows handling multiple languages within the same network, they are suboptimal when targeting a specific language; recent work reports gains from customized singlelanguage vocabularies (Delobelle et al, 2020).…”

Section: Multilingual Modelsmentioning

confidence: 99%

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Clark,

Garrette,

Turc

et al. 2021

Preprint

View full text Add to dashboard Cite

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences-without explicit tokenization or vocabulary-and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TYDI QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

show abstract

Section: Multilingual Modelsmentioning

confidence: 99%

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Clark,

Garrette,

Turc

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the second setup, we use BERTje, the Dutch BERT model of de Vries et al (2019) and RobBERT, the Dutch RoBERTa model of Delobelle et al (2020), with their corresponding English counterparts, as well as multilingual BERT (mBERT), as sequence classifiers on the Entailment task of SICK(-NL). Here we observe a similar pattern in the results in Table 3: while there are individual difference on the same task, the main surprise is that the Dutch dataset is harder, even when exactly the same model (mBERT) is used.…”

Section: Sickmentioning

confidence: 99%

“…Moreover, the syntactically parsed LASSY corpus of written Dutch (van Noord et al, 2013), and the SONAR corpus of written Dutch (Oostdijk et al, 2013) provide rich resources on which NLP systems may be developed. Indeed, Dutch is in the scope of the multilingual BERT models published by Google (Devlin et al, 2019), and two monolingual Dutch BERT models have been published as part of Hugging-Face's transformers library (de Vries et al, 2019;Delobelle et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

SICKNL: A Dataset for Dutch Natural Language Inference

Wijnholds¹,

Moortgat²

2021

Preprint

View full text Add to dashboard Cite

We present SICK-NL (read: signal), a dataset targeting Natural Language Inference in Dutch. SICK-NL is obtained by translating the SICK dataset of Marelli et al. (2014) from English into Dutch. Having a parallel inference dataset allows us to compare both monolingual and multilingual NLP models for English and Dutch on the two tasks. In the paper, we motivate and detail the translation process, perform a baseline evaluation on both the original SICK dataset and its Dutch incarnation SICK-NL, taking inspiration from Dutch skipgram embeddings and contextualised embedding models. In addition, we encapsulate two phenomena encountered in the translation to formulate stress tests and verify how well the Dutch models capture syntactic restructurings that do not affect semantics. Our main finding is all models perform worse on SICK-NL than on SICK, indicating that the Dutch dataset is more challenging than the English original. Results on the stress tests show that models don't fully capture word order freedom in Dutch, warranting future systematic studies.

show abstract

“…Initially, most of the research took place in English followed by multilingual approaches (Conneau et al, 2019). Although, multilingual approaches were trained on large texts of many languages, they were outperformed by single language models (de Vries et al, 2019;Martin et al, 2020;Le et al, 2020;Delobelle et al, 2020). Single language models trained with the Open Super-large Crawled ALMAnaCH coRpus (OSCAR) showed good performance due to the size and variance of the OS-CAR (Martin et al, 2020;Delobelle et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

GottBERT: a pure German Language Model

Scheible,

Thomczyk,

Tippmann

et al. 2020

Preprint

View full text Add to dashboard Cite

Lately, pre-trained language models advanced the field of natural language processing (NLP). The introduction of Bidirectional Encoders for Transformers (BERT) and its optimized version RoBERTa have had significant impact and increased the relevance of pre-trained models. First, research in this field mainly started on English data followed by models trained with multilingual text corpora. However, current research shows that multilingual models are inferior to monolingual models. Currently, no German single language RoBERTa model is yet published, which we introduce in this work (GottBERT). The German portion of the OS-CAR data set was used as text corpus. In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones. GottBERT was pre-trained related to the original RoBERTa model using fairseq. All downstream tasks were trained using hyperparameter presets taken from the benchmark of German BERT. The experiments were setup utilizing FARM. Performance was measured by the F 1 score. GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture. Even without extensive hyperparameter optimization, in all NER and one text classification task, GottBERT already outperformed all other tested German and multilingual models. In order to support the German NLP field, we publish GottBERT under the AGPLv3 license.

show abstract

RobBERT: a Dutch RoBERTa-based Language Model

Cited by 25 publications

References 11 publications

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

SICKNL: A Dataset for Dutch Natural Language Inference

GottBERT: a pure German Language Model

Contact Info

Product

Resources

About