RobBERT: a Dutch RoBERTa-based Language Model

Delobelle, Pieter; Winters, Thomas; Berendt, Bettina

doi:10.18653/v1/2020.findings-emnlp.292

Cited by 134 publications

(96 citation statements)

References 25 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the pre-publication of this work (Martin et al, 2019), many monolingual language models have appeared, e.g. (Le et al, 2019;Virtanen et al, 2019;Delobelle et al, 2020), for as much as 30 languages (Nozza et al, 2020). In almost all tested configurations they displayed better results than multilingual language models such as mBERT (Pires et al, 2019).…”

Section: How Much Data Do You Need?mentioning

confidence: 99%

CamemBERT: a Tasty French Language Model

Martin¹,

Müller²,

Suárez³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

461

174

View full text Add to dashboard Cite

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models-in all languages except English-very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

show abstract

Section: How Much Data Do You Need?mentioning

confidence: 99%

CamemBERT: a Tasty French Language Model

Martin¹,

Müller²,

Suárez³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

461

174

View full text Add to dashboard Cite

show abstract

“…Predictions for both tasks were made with BERTje and RobBERT (de Vries et al, 2019;Delobelle et al, 2020; the Dutch versions of BERT and RobBERTa) using HuggingFace 4.0.0 (Wolf et al, 2020). In an attempt to improve these models, the "tags" method described above was used, but with the "<met>" (onset) and "</met>" (offset) placeholders for generic features and the same more fine-grained placeholders as described above when using source domain features.…”

Section: Bertje and Robbertmentioning

confidence: 99%

Improving Hate Speech Type and Target Detection with Hateful Metaphor Features

Lemmens¹,

Markov²,

Daelemans³

2021

Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

View full text Add to dashboard Cite

We study the usefulness of hateful metaphors as features for the identification of the type and target of hate speech in Dutch Facebook comments. For this purpose, all hateful metaphors in the Dutch LiLaH corpus were annotated and interpreted in line with Conceptual Metaphor Theory and Critical Metaphor Analysis. We provide SVM and BERT/RoBERTa results, and investigate the effect of different metaphor information encoding methods on hate speech type and target detection accuracy. The results of the conducted experiments show that hateful metaphor features improve model performance for the both tasks. To our knowledge, it is the first time that the effectiveness of hateful metaphors as an information source for hate speech classification is investigated.

show abstract

“…Since the BERT models were found to be effective for a wide range of NLP tasks (Devlin et al, 2019), several efforts have been extended towards improving them by more efficient training strategies Yang et al, 2019b;Sanh et al, 2019;Lan et al, 2019), training them for different domains Lee et al, 2019a;Lee and Hsiang, 2019;Chalkidis et al, 2020;Gururangan et al, 2020) and languages (Devlin, 2018;de Vries et al, 2019;Le et al, 2020;Martin et al, 2020;Delobelle et al, 2020;Cañete et al, 2020). Within the clinical domain, different models include the BioBERT models pretrained on PubMed abstracts and PMC full-text articles (Lee et al, 2019a), SciBERT trained on scientific text , clinicalBERT models trained on patient notes from the MIMIC-III corpus (Johnson et al, 2016) (sometimes as a continuation of the BioBERT models) (Alsentzer et al, 2019), and BlueBERT models that also use Pubmed abstracts and MIMIC-III patient notes for training .…”

Section: Related Workmentioning

confidence: 99%

Are we there yet? Exploring clinical domain knowledge of BERT models

Sushil¹,

Šuster²,

Daelemans³

2021

Proceedings of the 20th Workshop on Biomedical Language Processing

View full text Add to dashboard Cite

We explore whether state-of-the-art BERT models encode sufficient domain knowledge to correctly perform domain-specific inference. Although BERT implementations such as BioBERT are better at domain-based reasoning than those trained on general-domain corpora, there is still a wide margin compared to human performance on these tasks. To bridge this gap, we explore whether supplementing textual domain knowledge in the medical NLI task: a) by further language model pretraining on the medical domain corpora, b) by means of lexical match algorithms such as the BM25 algorithm, c) by supplementing lexical retrieval with dependency relations, or d) by using a trained retriever module, can push this performance closer to that of humans. We do not find any significant difference between knowledge supplemented classification as opposed to the baseline BERT models, however. This is contrary to the results for evidence retrieval on other tasks such as open domain question answering (QA). By examining the retrieval output, we show that the methods fail due to unreliable knowledge retrieval for complex domain-specific reasoning. We conclude that the task of unsupervised text retrieval to bridge the gap in existing information to facilitate inference is more complex than what the state-of-the-art methods can solve, and warrants extensive research in the future.

show abstract

RobBERT: a Dutch RoBERTa-based Language Model

Cited by 134 publications

References 25 publications

CamemBERT: a Tasty French Language Model

CamemBERT: a Tasty French Language Model

Improving Hate Speech Type and Target Detection with Hateful Metaphor Features

Are we there yet? Exploring clinical domain knowledge of BERT models

Contact Info

Product

Resources

About