Continuous "always-on" monitoring is beneficial for a number of applications, but potentially imposes a high load in terms of communication, storage and power consumption when a large number of variables need to be monitored. We introduce two new filtering techniques, swing filters and slide filters, that represent within a prescribed precision a time-varying numerical signal by a piecewise linear function, consisting of connected line segments for swing filters and (mostly) disconnected line segments for slide filters. We demonstrate the effectiveness of swing and slide filters in terms of their compression power by applying them to a reallife data set plus a variety of synthetic data sets. For nearly all combinations of signal behavior and precision requirements, the proposed techniques outperform the earlier approaches for online filtering in terms of data reduction. The slide filter, in particular, consistently dominates all other filters, with up to twofold improvement over the best of the previous techniques.
Mashup editors, like Yahoo Pipes and IBM Lotus Mashup Maker, allow non-programmer end-users to "mash-up" information
A large number of web pages contain data structured in the form of "lists". Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well defined templates -they have inconsistent delimiters (if any) and often have missing information.We propose a novel technique for extracting tables from lists. The technique is domain-independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields, and then compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the Web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table's quality.We conducted an extensive experimental study using both real web lists and lists derived from tables on the Web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the Web. The analysis of the extracted tables have led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the Web.
A large number of web pages contain data structured in the form of "lists". Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well defined templates -they have inconsistent delimiters (if any) and often have missing information.We propose a novel technique for extracting tables from lists. The technique is domain-independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields, and then compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the Web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table's quality.We conducted an extensive experimental study using both real web lists and lists derived from tables on the Web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the Web. The analysis of the extracted tables have led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the Web.
No abstract
Abstract-The increasing popularity of social networks, such as Facebook and Orkut, has raised several privacy concerns. Traditional ways of safeguarding privacy of personal information by hiding sensitive attributes are no longer adequate. Research shows that probabilistic classification techniques can effectively infer such private information. The disclosed sensitive information of friends, group affiliations and even participation in activities, such as tagging and commenting, are considered background knowledge in this process. In this paper, we present a privacy protection tool, called Privometer, that measures the amount of sensitive information leakage in a user profile and suggests selfsanitization actions to regulate the amount of leakage. In contrast to previous research, where inference techniques use publicly available profile information, we consider an augmented model where a potentially malicious application installed in the user's friend profiles can access substantially more information. In our model, merely hiding the sensitive information is not sufficient to protect the user privacy. We present an implementation of Privometer in Facebook.
Abstract-Existing techniques for schema matching are classified as either schema-based, instance-based, or a combination of both. In this paper, we define a new class of techniques, called usage-based schema matching. The idea is to exploit information extracted from the query logs to find correspondences between attributes in the schemas to be matched. We propose methods to identify co-occurrence patterns between attributes in addition to other features such as their use in joins and with aggregate functions. Several scoring functions are considered to measure the similarity of the extracted features, and a genetic algorithm is employed to find the highestscore mappings between the two schemas. Our technique is suitable for matching schemas even when their attribute names are opaque. It can further be combined with existing techniques to obtain more accurate results. Our experimental study demonstrates the effectiveness of the proposed approach and the benefit of combining it with other existing approaches.
This paper gives an overview of Turn Data Management Platform (DMP). We explain the purpose of this type of platforms, and show how it is positioned in the current digital advertising ecosystem. We also provide a detailed description of the key components in Turn DMP. These components cover the functions of (1) data ingestion and integration, (2) data warehousing and analytics, and (3) real-time data activation. For all components, we discuss the main technical and research challenges, as well as the alternative design choices. One of the main goals of this paper is to highlight the central role that data management is playing in shaping this fast growing multi-billion dollars industry.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.