Trends and patterns of incidence of diffuse glioma in adults in the United States, 1973‐2014

The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation. This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.

show abstract

Profiling relational data: a survey

Abedjan¹,

2015

View full text Add to dashboard Cite

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

show abstract

The Stratosphere platform for big data analytics

et al. 2014

View full text Add to dashboard Cite

Schema Matching Using Duplicates

View full text Add to dashboard Cite

Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names.Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach.

show abstract

Analyzing and predicting viral tweets

2013

View full text Add to dashboard Cite

An Introduction to Duplicate Detection

Naumann

Herschel

2010

Synthesis Lectures on Data Management

178

View full text Add to dashboard Cite

In this lecture many applications process high volumes of streaming data, among them Internet traffic analysis, financial tickers, and transaction log mining. In general, a data stream is an unbounded data set that is produced incrementally over time, rather than being available in full before its processing begins. In this lecture, we give an overview of recent research in stream processing, ranging from answering simple queries on high-speed streams to loading real-time data feeds into a streaming warehouse for off-line analysis.We will discuss two types of systems for end-to-end stream processing: Data Stream Management Systems (DSMSs) and Streaming Data Warehouses (SDWs). A traditional database management system typically processes a stream of ad-hoc queries over relatively static data. In contrast, a DSMS evaluates static (long-running) queries on streaming data, making a single pass over the data and using limited working memory. In the first part of this lecture, we will discuss research problems in DSMSs, such as continuous query languages, non-blocking query operators that continually react to new data, and continuous query optimization. The second part covers SDWs, which combine the real-time response of a DSMS by loading new data as soon as they arrive with a data warehouse's ability to manage Terabytes of historical data on secondary storage.

show abstract

A Hybrid Approach to Functional Dependency Discovery

Papenbrock

Naumann

2016

110

View full text Add to dashboard Cite

Progressive Duplicate Detection

Papenbrock

Heise

Naumann

2015

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Data is an important task in real world; the common data is represented and used in all the fields. The duplicate data is executed and displayed in scenario. The proposed work two types of techniques used first one Progressive Sort Neighbourhood Method (PSNM) and Progressive Blocking (PB). Progressive Sort Neighbourhood Method is used to deliver the exact input based output and the method will separate the input based keywords and check the similarity of the output data. The progressive blocking is to filter the irrelevant information , keywords based indexing and entry level filtering standard input is implemented based on user requirement.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Felix Naumann

Data fusion

Profiling relational data: a survey

The Stratosphere platform for big data analytics

Schema Matching Using Duplicates

Analyzing and predicting viral tweets

An Introduction to Duplicate Detection

A Hybrid Approach to Functional Dependency Discovery

Progressive Duplicate Detection

Contact Info

Product

Resources

About