Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1231
|View full text |Cite
|
Sign up to set email alerts
|

Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities

Abstract: Scholars in inter-disciplinary fields like the Digital Humanities are increasingly interested in semantic annotation of specialized corpora. Yet, under-resourced languages, imperfect or noisily structured data, and user-specific classification tasks make it difficult to meet their needs using off-the-shelf models. Manual annotation of large corpora from scratch, meanwhile, can be prohibitively expensive. Thus, we propose an active learning solution for named entity recognition, attempting to maximize a custom … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 20 publications
(17 citation statements)
references
References 30 publications
(36 reference statements)
0
17
0
Order By: Relevance
“…Although the proposed solutions are shown to outperform other heuristic algorithms with comparably weak models (basic CRF or BERT without fine-tuning) in experiments with a small number of AL iterations, they can be not very practical due to the high computational costs of collecting training data for policy models. Other notable works on deep active learning include (Erdmann et al, 2019), which proposes an AL algorithm based on a bootstrapping approach (Jones et al, 1999) and (Lowell et al, 2019), which concerns the problem of the mismatch between a model used to construct a training dataset via AL (acquisition model) and a final model that is trained on it (successor model).…”
Section: Related Workmentioning
confidence: 99%
“…Although the proposed solutions are shown to outperform other heuristic algorithms with comparably weak models (basic CRF or BERT without fine-tuning) in experiments with a small number of AL iterations, they can be not very practical due to the high computational costs of collecting training data for policy models. Other notable works on deep active learning include (Erdmann et al, 2019), which proposes an AL algorithm based on a bootstrapping approach (Jones et al, 1999) and (Lowell et al, 2019), which concerns the problem of the mismatch between a model used to construct a training dataset via AL (acquisition model) and a final model that is trained on it (successor model).…”
Section: Related Workmentioning
confidence: 99%
“…We do not evaluate on a holdout set. Instead, we follow Erdmann et al (2019) and simulate annotating the complete corpus and evaluate on the very same data as we are interested in how an annotated sub-set helps to annotate the rest of the data, not how well the model generalizes. We assume that users annotate mention spans perfectly, i.e.…”
Section: Simulationmentioning
confidence: 99%
“…Another option is active learning, where a ML system asks an oracle (or a user) to select the most relevant examples to consider, thereby lowering the number of data points required to learn a model. This is the approach adopted by Erdmann et al [60] to recognise entities in various Latin classical texts, based on an active learning pipeline able to predict how many and which sentences need to be annotated to achieve a certain degree of accuracy, and later on released as toolkit to build custom NER models for the humanities [61]. Finally, another strategy is data augmentation, where an existing data set is expanded via the transformation of training instances without changing their label.…”
Section: Dealing With the Lack Of Resourcesmentioning
confidence: 99%