Truth finding on the deep web

Li, Xian; Dong, Xin Luna; Lyons, Kenneth B.; Meng, Weiyi; Srivastava, Divesh

doi:10.14778/2535568.2448943

Cited by 216 publications

(159 citation statements)

References 16 publications

(31 reference statements)

Supporting

Mentioning

155

Contrasting

Unclassified

Order By: Relevance

“…While extremely powerful, there are scenarios where this sampling model does not apply. Most importantly, data sources are not always independent [24]. Furthermore, the number of data sources l has to be large enough to have sufficient overlap between the sources (see Section 6).…”

Section: Data Integration As Sampling Processmentioning

confidence: 99%

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Chung

Mortensen

Binnig

et al. 2018

ACM Trans. Database Syst.

View full text Add to dashboard Cite

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results?In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

show abstract

Section: Data Integration As Sampling Processmentioning

confidence: 99%

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Chung

Mortensen

Binnig

et al. 2018

ACM Trans. Database Syst.

View full text Add to dashboard Cite

show abstract

“…These problems have been studied in areas such as knowledge discovery, web personalization, and fact checking [7][8][9][10]. In order to make sense of the data, we must address problems such as the missing or inconsistent data problems while at the same time coping with the sheer amount of data presented to us.…”

Section: Introductionmentioning

confidence: 99%

Cross-Checking Multiple Data Sources Using Multiway Join in MapReduce

Afrati

Momani

Stasinopoulos

2017

Scientific Programming

View full text Add to dashboard Cite

As data sources accumulate information and data size escalates it becomes more and more difficult to maintain the correctness and validity of these datasets. Therefore, tools must emerge to facilitate this daunting task. Fact checking usually involves a large number of data sources that talk about the same thing but we are not sure which holds the correct information or which has any information at all about the query we care for. A join among all or some data sources can guide us through a fact-checking process. However, when we want to perform this join on a distributed computational environment such as MapReduce, it is not obvious how to distribute efficiently the records in the data sources to the reduce tasks in order to join any subset of them in a single MapReduce job. To this end, we propose an efficient approach using the multiway join to cross-check these data sources in a single round.

show abstract

“…While the truth discovery problem has been studied from different perspectives [12], it remains inefficient. Waguih et al [18] experimentally evaluated the performance of several truth discovery algorithms on three computing nodes on both real-world and synthetic datasets with various configurations, and concluded that most algorithms have efficiency problems.…”

Section: Introductionmentioning

confidence: 99%

Approximate Truth Discovery via Problem Scale Reduction

Wang

Sheng

Fang

et al. 2015

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Many real-world applications rely on multiple data sources to provide information on their interested items. Due to the noises and uncertainty in data, given a specific item, the information from different sources may conflict. To make reliable decisions based on these data, it is important to identify the trustworthy information by resolving these conflicts, i.e., the truth discovery problem. Current solutions to this problem detect the veracity of each value jointly with the reliability of each source for every data item. In this way, the efficiency of truth discovery is strictly confined by the problem scale, which in turn limits truth discovery algorithms from being applicable on a large scale. To address this issue, we propose an approximate truth discovery approach, which divides sources and values into groups according to a userspecified approximation criterion. The groups are then used for efficient inter-value influence computation to improve the accuracy. Our approach is applicable to most existing truth discovery algorithms. Experiments on real-world datasets show that our approach improves the efficiency compared to existing algorithms while achieving similar or even better accuracy. The scalability is further demonstrated by experiments on large synthetic datasets.

show abstract

Truth finding on the deep web

Cited by 216 publications

References 16 publications

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Cross-Checking Multiple Data Sources Using Multiway Join in MapReduce

Approximate Truth Discovery via Problem Scale Reduction

Contact Info

Product

Resources

About