We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative examples. They suffer from input ambiguity and semantic drift, or are not viable options for ad-hoc tail concepts. In this paper, we propose to leverage the millions of tables on the web for this problem. The core technical challenge is to identify the "exclusive" tables for a concept to prevent semantic drift; existing holistic ranking techniques like personalized PageRank are inadequate for this purpose. We develop novel probabilistic ranking methods that can model a new type of table-entity relationship. Experiments with real-life concepts show that our proposed solution is significantly more effective than applying state-of-the-art set expansion or holistic ranking techniques.
When collecting and combining data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often expensive, challenge. Common data quality problems include inconsistent data conventions amongst sources such as different abbreviations or synonyms; data entry errors such as spelling mistakes; missing, incomplete, outdated or otherwise incorrect attribute values. These data defects generally manifest themselves as foreign-key mismatches and approximately duplicate records, both of which make further data mining and decision support analyses either impossible or suspect. We demonstrate two new data cleansing operators, Fuzzy Lookup and Fuzzy Grouping, which address these problems in a scalable and domain-independent manner. These operators are implemented within Microsoft SQL Server 2005 Integration Services. Our demo will explain their functionality and highlight multiple realworld scenarios in which they can be used to achieve high data quality.
No abstract
Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record matching to take such user-defined string transformations as input. These transformations coupled with an underlying similarity function are used to define the similarity between two strings. We demonstrate the effectiveness of this approach via a fuzzy match operation that is used to lookup an input record against a table of records, where we have an additional table of transformations as input. We demonstrate an improvement in record matching quality and efficient retrieval based on our index structure that is cognizant of transformations.
It is well known today that pages on the Web contain a large number of content-rich relational tables. Such tables have been systematically extracted in a number of efforts to empower important applications such as table search and schema discovery. However, a significant fraction of relational tables are not embedded in the standard HTML table tags, and are thus difficult to extract. In particular, a large number of relational tables are known to be in a "list" form, which contains a list of clearly separated rows that are not separated into columns.In this work, we address the important problem of automatically extracting multi-column relational tables from such lists. Our key intuition lies in the simple observation that in correctly-extracted tables, values in the same column are coherent, both at a syntactic and at a semantic level. Using a background corpus of over 100 million tables crawled from the Web, we quantify semantic coherence based on a statistical measure of value co-occurrence in the same column from the corpus. We then model table extraction as a principled optimization problem -we allocate tokens in each row sequentially to a fixed number of columns, such that the sum of coherence across all pairs of values in the same column is maximized. Borrowing ideas from A search and metric distance, we develop an efficient 2-approximation algorithm. We conduct large-scale table extraction experiments using both real Web data and proprietary enterprise spreadsheet data. Our approach considerably outperforms the state-of-the-art approaches in terms of quality, achieving over 90% F-measure across many cases.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.