Abstract:Distant supervision, a paradigm of relation extraction where training data is created by aligning facts in a database with a large unannotated corpus, is an attractive approach for training relation extractors. Various models are proposed in recent literature to align the facts in the database to their mentions in the corpus. In this paper, we discuss and critically analyse a popular alignment strategy called the "at least one" heuristic. We provide a simple, yet effective relaxation to this strategy. We formu… Show more
“…While supervised entity recognition systems [14,34] focus on a few common entity types, weakly-supervised methods [18,36] and distantly-supervised methods [41,54,26] use large text corpus and a small set of seeds (or a knowledge base) to induce patterns or to train models, and thus can apply to different domains without additional human annotation labor. For relation extraction, similarly, weak supervision [6,13] and distant supervision [35,53,49,21,43,31] approaches are proposed to address the domain restriction issue in traditional supervised systems [2,33,17]. However, such a "pipeline" diagram ignores the dependencies between different sub tasks and may suffer from error propagation between the tasks.…”
Extracting entities and relations for types of interest from text is important for understanding massive text corpora. Traditionally, systems of entity relation extraction have relied on human-annotated corpora for training and adopted an incremental pipeline. Such systems require additional human expertise to be ported to a new domain, and are vulnerable to errors cascading down the pipeline. In this paper, we investigate joint extraction of typed entities and relations with labeled data heuristically obtained from knowledge bases (i.e., distant supervision). As our algorithm for type labeling via distant supervision is context-agnostic, noisy training data poses unique challenges for the task. We propose a novel domainindependent framework, called COTYPE, that runs a data-driven text segmentation algorithm to extract entity mentions, and jointly embeds entity mentions, relation mentions, text features and type labels into two low-dimensional spaces (for entity and relation mentions respectively), where, in each space, objects whose types are close will also have similar representations. COTYPE, then using these learned embeddings, estimates the types of test (unlinkable) mentions. We formulate a joint optimization problem to learn embeddings from text corpora and knowledge bases, adopting a novel partial-label loss function for noisy labeled data and introducing an object "translation" function to capture the cross-constraints of entities and relations on each other. Experiments on three public datasets demonstrate the effectiveness of COTYPE across different domains (e.g., news, biomedical), with an average of 25% improvement in F1 score compared to the next best method.
“…While supervised entity recognition systems [14,34] focus on a few common entity types, weakly-supervised methods [18,36] and distantly-supervised methods [41,54,26] use large text corpus and a small set of seeds (or a knowledge base) to induce patterns or to train models, and thus can apply to different domains without additional human annotation labor. For relation extraction, similarly, weak supervision [6,13] and distant supervision [35,53,49,21,43,31] approaches are proposed to address the domain restriction issue in traditional supervised systems [2,33,17]. However, such a "pipeline" diagram ignores the dependencies between different sub tasks and may suffer from error propagation between the tasks.…”
Extracting entities and relations for types of interest from text is important for understanding massive text corpora. Traditionally, systems of entity relation extraction have relied on human-annotated corpora for training and adopted an incremental pipeline. Such systems require additional human expertise to be ported to a new domain, and are vulnerable to errors cascading down the pipeline. In this paper, we investigate joint extraction of typed entities and relations with labeled data heuristically obtained from knowledge bases (i.e., distant supervision). As our algorithm for type labeling via distant supervision is context-agnostic, noisy training data poses unique challenges for the task. We propose a novel domainindependent framework, called COTYPE, that runs a data-driven text segmentation algorithm to extract entity mentions, and jointly embeds entity mentions, relation mentions, text features and type labels into two low-dimensional spaces (for entity and relation mentions respectively), where, in each space, objects whose types are close will also have similar representations. COTYPE, then using these learned embeddings, estimates the types of test (unlinkable) mentions. We formulate a joint optimization problem to learn embeddings from text corpora and knowledge bases, adopting a novel partial-label loss function for noisy labeled data and introducing an object "translation" function to capture the cross-constraints of entities and relations on each other. Experiments on three public datasets demonstrate the effectiveness of COTYPE across different domains (e.g., news, biomedical), with an average of 25% improvement in F1 score compared to the next best method.
“…Besides, Fan et al [24] presented a novel framework by integrating active learning and weakly supervised learning. Nagesh et al [25] solved the label assigning problem with integer linear programming (ILP) and improved the baselines. In addition, there are some deep learning based methods using convolutional neural networks to do feature modeling and MIL to do distant supervision [26].…”
Section: Distant Supervision For Relation Extractionmentioning
Abstract:Relation extraction has benefited from distant supervision in recent years with the development of natural language processing techniques and data explosion. However, distant supervision is still greatly limited by the quality of training data, due to its natural motivation for greatly reducing the heavy cost of data annotation. In this paper, we construct an architecture called MIML-sort (Multi-instance Multi-label Learning with Sorting Strategies), which is built on the famous MIML framework. Based on MIML-sort, we propose three ranking-based methods for sample selection with which we identify relation extractors from a subset of the training data. Experiments are set up on the KBP (Knowledge Base Propagation) corpus, one of the benchmark datasets for distant supervision, which is large and noisy. Compared with previous work, the proposed methods produce considerably better results. Furthermore, the three methods together achieve the best F 1 on the official testing set, with an optimal enhancement of F 1 from 27.3% to 29.98%.
“…Most aforementioned work used SIL, MIL, or MIML to train classifiers, which set strong baselines in this field. In addition, recent researches also include embedding based models that transferred the relation extraction problem into a translation model like ℎ + ≈ [22][23][24], nonnegative matrix factorization (NMF) models [8,9] with the characteristics of training and testing jointly, integrating active learning and weakly supervised learning [25], integer linear programming (ILP) [26], and so on.…”
Section: Distant Supervision For Relation Extractionmentioning
confidence: 99%
“…We test on the KBP dataset, one of the benchmark datasets in this literature constructed by Surdeanu et al [4]. The resources are mainly from the TAC KBP 2010 and 2011 slot filling shared tasks [25,26] which contain 183,062 and 3,334 entity pairs for training and testing. The free texts come from the collection provided by the shared task, which contains approximately 1.5 million documents from a variety of sources, including newswire, blogs, and telephone conversation transcripts.…”
Distant supervision (DS) automatically annotates free text with relation mentions from existing knowledge bases (KBs), providing a way to alleviate the problem of insufficient training data for relation extraction in natural language processing (NLP). However, the heuristic annotation process does not guarantee the correctness of the generated labels, promoting a hot research issue on how to efficiently make use of the noisy training data. In this paper, we model two types of biases to reduce noise: (1)bias-distto model the relative distance between points (instances) and classes (relation centers); (2)bias-rewardto model the possibility of each heuristically generated label being incorrect. Based on the biases, we propose three noise tolerant models:MIML-dist,MIML-dist-classify, andMIML-reward, building on top of a state-of-the-art distantly supervised learning algorithm. Experimental evaluations compared with three landmark methods on the KBP dataset validate the effectiveness of the proposed methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.