The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 facts, but suggested a challenge: How can we improve KNOW-ITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall. List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domainindependent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on named-entity extraction, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOW-ITALL a 4-fold to 8-fold increase in recall, while maintaining high precision, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KNOWITALL, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner.The paper describes preliminary experiments in which an instance of KNOWITALL, running for four days on a single machine, was able to automatically extract 54,753 facts. KNOWITALL associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KNOWITALL's architecture and reports on lessons learned for the design of large-scale information extraction systems.
The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is available. The paper presents a scalable, fullyimplemented system that runs in O(KN log N ) time in the number of extractions, N , and the maximum number of synonyms per word, K. The system, called Resolver, introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them. On a set of two million assertions extracted from the Web, Resolver resolves objects with 78% precision and 68% recall, and resolves relations with 90% precision and 35% recall. Several variations of Resolver's probabilistic model are explored, and experiments demonstrate that under appropriate conditions these variations can improve F1 by 5%. An extension to the basic Resolver system allows it to handle polysemous names with 97% precision and 95% recall on a data set from the TREC corpus.
[1] Using 3D terrestrial laser scan (TLS) technology, we have recorded postseismic deformation on and adjacent to the surface rupture formed during the 6th April 2009 L'Aquila normal faulting earthquake (Mw 6.3). Using surface modeling techniques and repeated surveys 8-124 days after the earthquake, we have produced a 4D dataset of postseismic deformation across a 3 × 65 m area at high horizontal spatial resolution. We detected millimetrescale movements partitioned between discrete surface rupture slip and development of a hangingwall syncline over 10's of meters. We interpret the results as the signal of shallow afterslip in the fault zone. We find 52% of the total postseismic hangingwall vertical motion occurs as deformation within 30 m of the surface rupture. The total postseismic vertical motions are approximately 50% that of the coseismic. We highlight the importance of quantifying partitioned postseismic contributions when applying empirical slip-magnitude datasets to infer palaeoearthquake magnitudes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.