Abstract:We challenge a common assumption in active learning, that a list-based interface populated by informative samples provides for efficient and effective data annotation. We show how a 2D scatterplot populated with diverse and representative samples can yield improved models given the same time budget. We consider this for bootstrapping-based information extraction, in particular named entity classification, where human and machine jointly label data. To enable effective data annotation in a scatterplot, we have … Show more
“…In (Lison et al, 2020), the weak training data is created by broadly collecting available labeling rules from multiple sources, which demonstrates the importance of being able to automatically find new heuristics missed by human efforts. To find new heuristic rules on the basis of a relatively limited number of manually designed rules, previous studies have tried bootstrapping by relying on the co-occurrence, context and pattern features (Thelen and Riloff, 2002;Riloff et al, 2003;Yangarber, 2003;Shen et al, 2017;Tao et al, 2015;Berger et al, 2018;Yan et al, 2019).…”
Instead of using expensive manual annotations, researchers have proposed to train named entity recognition (NER) systems using heuristic labeling rules. However, devising labeling rules is challenging because it often requires a considerable amount of manual effort and domain expertise. To alleviate this problem, we propose GLARA, a graph-based labeling rule augmentation framework, to learn new labeling rules from unlabeled data. We first create a graph with nodes representing candidate rules extracted from unlabeled data. Then, we design a new graph neural network to augment labeling rules by exploring the semantic relations between rules. We finally apply the augmented rules on unlabeled data to generate weak labels and train a NER model using the weakly labeled data. We evaluate our method on three NER datasets and find that we can achieve an average improvement of +20% F1 score over the best baseline when given a small set of seed rules. ⋮ Labeling Rule Applier *noma -> Disease *athy -> Disease *tion -> Other *lity -> Other Seeding Rules *noma-> Disease *athy -> Disease *homa -> Disease *kemias-> Disease *ndrome-> Disease Rank and Select New Rules ... *noma *kemias *ation *tion *lity *ility *athy *ndrome *homa *itoyl *ation *ency *trophy *onia *noma *tonia *kemias *tity *itity *tion *ndrome *homa *ndrome Candidate Rules
“…In (Lison et al, 2020), the weak training data is created by broadly collecting available labeling rules from multiple sources, which demonstrates the importance of being able to automatically find new heuristics missed by human efforts. To find new heuristic rules on the basis of a relatively limited number of manually designed rules, previous studies have tried bootstrapping by relying on the co-occurrence, context and pattern features (Thelen and Riloff, 2002;Riloff et al, 2003;Yangarber, 2003;Shen et al, 2017;Tao et al, 2015;Berger et al, 2018;Yan et al, 2019).…”
Instead of using expensive manual annotations, researchers have proposed to train named entity recognition (NER) systems using heuristic labeling rules. However, devising labeling rules is challenging because it often requires a considerable amount of manual effort and domain expertise. To alleviate this problem, we propose GLARA, a graph-based labeling rule augmentation framework, to learn new labeling rules from unlabeled data. We first create a graph with nodes representing candidate rules extracted from unlabeled data. Then, we design a new graph neural network to augment labeling rules by exploring the semantic relations between rules. We finally apply the augmented rules on unlabeled data to generate weak labels and train a NER model using the weakly labeled data. We evaluate our method on three NER datasets and find that we can achieve an average improvement of +20% F1 score over the best baseline when given a small set of seed rules. ⋮ Labeling Rule Applier *noma -> Disease *athy -> Disease *tion -> Other *lity -> Other Seeding Rules *noma-> Disease *athy -> Disease *homa -> Disease *kemias-> Disease *ndrome-> Disease Rank and Select New Rules ... *noma *kemias *ation *tion *lity *ility *athy *ndrome *homa *itoyl *ation *ency *trophy *onia *noma *tonia *kemias *tity *itity *tion *ndrome *homa *ndrome Candidate Rules
“…However, those heuristic constraints are usually not flexible due to their requirement for expert efforts. In contrast, recent studies focus on learning the distance metrics to determine boundaries using weak supervision Berger et al, 2018;Zupon et al, 2019;Yan et al, 2020a). For example, Yan et al (2020a) propose an end-toend bootstrapping network learned by multi-view learning, and extend it by self-supervised and supervised pre-training (Yan et al, 2020b).…”
Section: Related Workmentioning
confidence: 99%
“…Unfortunately, these heuristic metrics heavily depend on the selected seeds, making the boundary biased and unreliable (Curran et al, 2007;McIntosh and Curran, 2009). Although some studies extend them with extra constraints (Carlson et al, 2010) or manual participants (Berger et al, 2018), the requirement of expert knowledge makes them ad-hoc and inflexible. Some studies try to learn the distance metrics (Zupon et al, 2019;Yan et al, 2020a), but they still suffer from weak supervision.…”
Bootstrapping has become the mainstream method for entity set expansion. Conventional bootstrapping methods mostly define the expansion boundary using seed-based distance metrics, which heavily depend on the quality of selected seeds and are hard to be adjusted due to the extremely sparse supervision. In this paper, we propose Bootstrap-GAN, a new learning method for bootstrapping which jointly models the bootstrapping process and the boundary learning process in a GAN framework. Specifically, the expansion boundaries of different bootstrapping iterations are learned via different discriminator networks; the bootstrapping network is the generator to generate new positive entities, and the discriminator networks identify the expansion boundaries by trying to distinguish the generated entities from known positive entities. By iteratively performing the above adversarial learning, the generator and the discriminators can reinforce each other and be progressively refined along the whole bootstrapping process. Experiments show that Bootstrap-GAN achieves the new state-of-the-art entity set expansion performance.
“…The pipelined methods (Riloff and Jones, 1999;Collins and Singer, 1999) mainly leverage direct co-occurrence information, which will easily lead to the semantic drifting problem (Curran et al, 2007). To resolve this problem, many pipelined methods are proposed, e.g., mutual exclusive bootstrapping (Curran et al, 2007;Curran, 2008, 2009;Gupta et al, 2018), bootstrapping using negative seeds (Yangarber et al, 2002;Shi et al, 2014), lexical and statistical features (Liao and Grishman, 2010;Gupta and Manning, 2014), word embeddings (Batista et al, 2015;Gupta and Manning, 2015;Zupon et al, 2019), active learning (Berger et al, 2018), lookahead search (Yan et al, 2019), etc. Recently Yan et al (2020 propose an end-to-end bootstrapping model and show its advantages in information leveraging and flexibility.…”
Bootstrapping for entity set expansion (ESE) has been studied for a long period, which expands new entities using only a few seed entities as supervision. Recent end-to-end bootstrapping approaches have shown their advantages in information capturing and bootstrapping process modeling. However, due to the sparse supervision problem, previous endto-end methods often only leverage information from near neighborhoods (local semantics) rather than those propagated from the co-occurrence structure of the whole corpus (global semantics). To address this issue, this paper proposes Global Bootstrapping Network (GBN) with the "pre-training and fine-tuning" strategies for effective learning. Specifically, it contains a global-sighted encoder to capture and encode both local and global semantics into entity embedding, and an attention-guided decoder to sequentially expand new entities based on these embeddings. The experimental results show that the GBN learned by "pretraining and fine-tuning" strategies achieves state-of-the-art performance on two bootstrapping datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.