ChouBERT: Pre-training French Language Model for Crowdsensing with Tweets in Phytosanitary Context

Jiang, Shufan; Angarita, Rafael; Cormier, Stéphane; Orensanz, Julien; Rousseaux, Francis

doi:10.1007/978-3-031-05760-1_40

Cited by 2 publications

(11 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, the main challenge of detecting natural hazards from textual contents on social media is to identify unseen risks with low resources for training. We reuse the labeled tweets produced by ChouBERT (Jiang et al, 2022 ), tweets about corn borer, barley yellow dwarf virus (BYDV) and corvids for training and validation, and tweets about unseen and polysemous terms such as “taupin” (wireworm in English) for testing the generalizability of the classifier. Since the binary cross entropy loss adopted by the discriminator of GAN-BERT favors the majority class when data are unbalanced, for the different training experiments, we sampled ChouBERT's training data to 16, 32, 64, 128, 256, and 512 subsets, each subset having equal number of observations and non-observations.…”

Section: Methodsmentioning

confidence: 99%

“…Following the study of Jiang et al ( 2022 ), the ChouBERT models 6 are further-pre-trained CamemBERT-base models over French Plant Health Bulletins and Tweets and the ChouBERT pre-trained for 16 epochs (denoted as ChouBERT-16) and for 32 epochs (denoted as ChouBERT-32) are the most efficient in finding observations about plant health issues. Thus, in this study, we combine GAN-BERT settings with CamemBERT, ChouBERT-16, and ChouBERT-32.…”

Section: Methodsmentioning

confidence: 99%

“…When applying the PLM-only classification to other datasets in other domains, we might need to find an optimal threshold depending on the real needs for precision or recall. Based on the results presented in the study of Jiang et al ( 2022 ), we fixed the learning rate to 2e −5 , the maximum sequence length to 128, and fit the classifier for 10 epochs. We set the batch size to ( training _ data _ size /8) to have the same steps for the different training data sizes.…”

Section: Methodsmentioning

confidence: 99%

“…Still, the extraction of useful plant health information from social media poses some challenges, including lack of context, irrelevancy, homographs, homophones, homonyms, slangs, and colloquialisms. In an earlier study, we developed ChouBERT (Jiang et al, 2022 ) to detect farmers' observations from tweets for pest monitoring. ChouBERT takes a pre-trained CamemBERT (Martin et al, 2020 ) model and further pre-trains it on a plant health domain corpus in French to improve the generalizability of plant health hazards detection on Twitter.…”

Section: Introductionmentioning

confidence: 99%

“…Among the French varieties of BERT, CamemBERT (Martin et al, 2020 ) is a model based on the same architecture as BERT but trained on a French corpus with MLM only. ChouBERT (Jiang et al, 2022 ) takes a pre-trained CamemBERT-base checkpoint and further pre-trains it with MLM over a corpus in French in the plant health domain to improve performance in detecting plant health issues from short texts, particularly, from Twitter.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Improving text mining in plant health domain with GAN and/or pre-trained language model

Jiang

Cormier

Angarita

et al. 2023

Front. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

The Bidirectional Encoder Representations from Transformers (BERT) architecture offers a cutting-edge approach to Natural Language Processing. It involves two steps: 1) pre-training a language model to extract contextualized features and 2) fine-tuning for specific downstream tasks. Although pre-trained language models (PLMs) have been successful in various text-mining applications, challenges remain, particularly in areas with limited labeled data such as plant health hazard detection from individuals' observations. To address this challenge, we propose to combine GAN-BERT, a model that extends the fine-tuning process with unlabeled data through a Generative Adversarial Network (GAN), with ChouBERT, a domain-specific PLM. Our results show that GAN-BERT outperforms traditional fine-tuning in multiple text classification tasks. In this paper, we examine the impact of further pre-training on the GAN-BERT model. We experiment with different hyper parameters to determine the best combination of models and fine-tuning parameters. Our findings suggest that the combination of GAN and ChouBERT can enhance the generalizability of the text classifier but may also lead to increased instability during training. Finally, we provide recommendations to mitigate these instabilities.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Improving text mining in plant health domain with GAN and/or pre-trained language model

Jiang

Cormier

Angarita

et al. 2023

Front. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

Named Entity Recognition for Monitoring Plant Health Threats in Tweets: a ChouBERT Approach

Jiang

Angarita

Cormier

et al. 2022

2022 6th International Conference on Universal Village (UV)

Self Cite

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Copyright

show abstract

ChouBERT: Pre-training French Language Model for Crowdsensing with Tweets in Phytosanitary Context

Cited by 2 publications

References 9 publications

Improving text mining in plant health domain with GAN and/or pre-trained language model

Improving text mining in plant health domain with GAN and/or pre-trained language model

Named Entity Recognition for Monitoring Plant Health Threats in Tweets: a ChouBERT Approach

Contact Info

Product

Resources

About