ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation

Wecker, Hanna; Friedrich, Annemarie; Adel, Heike

doi:10.18653/v1/2020.eval4nlp-1.15

Cited by 1 publication

(3 citation statements)

References 19 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further, there is a line of work now questioning traditional train-dev splits [6] as well as random splits [16]. More challenging datasplits can be created by clustering the documents based on their similarity, where each split encodes unique information to a certain degree [18]. We use this method to train ensembles of models on these splits in a cross-validation format, such that each model has observed slightly different training instances.…”

Section: Related Workmentioning

confidence: 99%

“…Then, we train the model using only the train-fraction of all the data and use the held-out validation data to determine the best model, which is then used to annotate the test data. As an alternative to random splits, we follow [18] and create strategic datasplits by clustering the documents according to their similarity. This creates more challenging splits, as more distant documents are left out for validation.…”

Section: Strategic Datasplitsmentioning

confidence: 99%

“…Our solution for these tasks is a neural sequence tagger based on multilingual transformer models. In particular, we experiment with continuing the masked language modeling pretraining of the multilingual XLM-R model [3] on Spanish texts, transferring trained models between the two tasks [13] and using strategic datasplits [18].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Boosting Transformers for Job Expression Extraction and Classification in a Low-Resource Setting

Lange,

Adel,

Strötgen

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we explore possible improvements of transformer models in a low-resource setting. In particular, we present our approaches to tackle the first two of three subtasks of the MEDDOPROF competition, i.e., the extraction and classification of job expressions in Spanish clinical texts. As neither language nor domain experts, we experiment with the multilingual XLM-R transformer model and tackle these low-resource information extraction tasks as sequence-labeling problems. We explore domain-and language-adaptive pretraining, transfer learning and strategic datasplits to boost the transformer model. Our results show strong improvements using these methods by up to 5.3 F1 points compared to a fine-tuned XLM-R model. Our best models achieve 83.2 and 79.3 F1 for the first two tasks, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Strategic Datasplitsmentioning

confidence: 99%