2021
DOI: 10.48550/arxiv.2109.10234
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets

Abstract: We introduce BERTweetFR, the first largescale pre-trained language model for French tweets.Our model is initialized using the general-domain French language model CamemBERT (Martin et al., 2020) which follows the base architecture of RoBERTa. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and ann… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 15 publications
0
4
0
Order By: Relevance
“…Continual learning aims to specialize a PLM to a particular domain by continuing the pretraining task on the abundant unlabeled corpora, such as biological documents (BioBERT [90]) scientific papers (SciBERT [91]), clinical notes (ClinicalBERT [92], [93], ClinicalCLNet [94]), financial new (FinBERT [11]), legal documents (LegalBERT [12]) and tweets (BERTweet [95] for English tweets and BERTweetFR [96] for french tweets). During this stage, we continue to train the PLM using the same pretraining task, which is usually a language modeling objective, on the target domain datasets.…”
Section: A Continual Learningmentioning
confidence: 99%
“…Continual learning aims to specialize a PLM to a particular domain by continuing the pretraining task on the abundant unlabeled corpora, such as biological documents (BioBERT [90]) scientific papers (SciBERT [91]), clinical notes (ClinicalBERT [92], [93], ClinicalCLNet [94]), financial new (FinBERT [11]), legal documents (LegalBERT [12]) and tweets (BERTweet [95] for English tweets and BERTweetFR [96] for french tweets). During this stage, we continue to train the PLM using the same pretraining task, which is usually a language modeling objective, on the target domain datasets.…”
Section: A Continual Learningmentioning
confidence: 99%
“…Given the difference in writing formality on Twitter, we use Twitter-based in addition to general-purpose models. For the former, XLM-T [2] is our multilingual model, BERTweet [20] for English, RoBERTuito [23] for Spanish, AraBERT [1] for Arabic, and BERTweetFR [14] for French. For the latter, we use multilingual BERT [9] as our multilingual model, BERT for English and Arabic, and BETO [6] for Spanish.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…We also introduce BERTweetFR [Guo+21], the first large-scale pre-trained language model for French tweets. As a valuable resource for social media data, tweets are often written in an informal tone and have their own set of characteristics compared to conventional sources.…”
Section: Large Scale Linguistic Resourcesmentioning
confidence: 99%