2022
DOI: 10.1007/978-3-031-05760-1_40
|View full text |Cite
|
Sign up to set email alerts
|

ChouBERT: Pre-training French Language Model for Crowdsensing with Tweets in Phytosanitary Context

Abstract: To fulfil the increasing need for food of the growing population and face climate change, modern technologies have been applied to improve different farming processes. One important application scenario is to detect and measure natural hazards using sensors and data analysis techniques. Crowdsensing is a sensing paradigm that empowers ordinary people to contribute with data their sensor-enhanced mobile devices gather or generate. In this paper, we propose to use Twitter as an open crowdsensing platform for acq… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

2
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(11 citation statements)
references
References 9 publications
2
9
0
Order By: Relevance
“…Therefore, the main challenge of detecting natural hazards from textual contents on social media is to identify unseen risks with low resources for training. We reuse the labeled tweets produced by ChouBERT (Jiang et al, 2022 ), tweets about corn borer, barley yellow dwarf virus (BYDV) and corvids for training and validation, and tweets about unseen and polysemous terms such as “taupin” (wireworm in English) for testing the generalizability of the classifier. Since the binary cross entropy loss adopted by the discriminator of GAN-BERT favors the majority class when data are unbalanced, for the different training experiments, we sampled ChouBERT's training data to 16, 32, 64, 128, 256, and 512 subsets, each subset having equal number of observations and non-observations.…”
Section: Methodsmentioning
confidence: 99%
See 4 more Smart Citations
“…Therefore, the main challenge of detecting natural hazards from textual contents on social media is to identify unseen risks with low resources for training. We reuse the labeled tweets produced by ChouBERT (Jiang et al, 2022 ), tweets about corn borer, barley yellow dwarf virus (BYDV) and corvids for training and validation, and tweets about unseen and polysemous terms such as “taupin” (wireworm in English) for testing the generalizability of the classifier. Since the binary cross entropy loss adopted by the discriminator of GAN-BERT favors the majority class when data are unbalanced, for the different training experiments, we sampled ChouBERT's training data to 16, 32, 64, 128, 256, and 512 subsets, each subset having equal number of observations and non-observations.…”
Section: Methodsmentioning
confidence: 99%
“…Following the study of Jiang et al ( 2022 ), the ChouBERT models 6 are further-pre-trained CamemBERT-base models over French Plant Health Bulletins and Tweets and the ChouBERT pre-trained for 16 epochs (denoted as ChouBERT-16) and for 32 epochs (denoted as ChouBERT-32) are the most efficient in finding observations about plant health issues. Thus, in this study, we combine GAN-BERT settings with CamemBERT, ChouBERT-16, and ChouBERT-32.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations