Recent years have witnessed an unprecedented proliferation of social media. People around the globe author, every day, millions of blog posts, micro-blog posts, social network status updates, etc. This rich stream of information can be used to identify, on an ongoing basis, emerging stories, and events that capture popular attention. Stories can be identified via groups of tightly-coupled realworld entities, namely the people, locations, products, etc., that are involved in the story. The sheer scale, and rapid evolution of the data involved necessitate highly efficient techniques for identifying important stories at every point of time.The main challenge in real-time story identification is the maintenance of dense subgraphs (corresponding to groups of tightlycoupled entities) under streaming edge weight updates (resulting from a stream of user-generated content). This is the first work to study the efficient maintenance of dense subgraphs under such streaming edge weight updates. For a wide range of definitions of density, we derive theoretical results regarding the magnitude of change that a single edge weight update can cause. Based on these, we propose a novel algorithm, DYNDENS, which outperforms adaptations of existing techniques to this setting, and yields meaningful results. Our approach is validated by a thorough experimental evaluation on large-scale real and synthetic datasets.
Privacy is a serious concern when microdata need to be released for ad hoc analyses. The privacy goals of existing privacy protection approaches (e.g., -anonymity and -diversity)
We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.
Histograms and related synopsis structures are popular techniques for approximating data distributions. These have been successful in query optimization and a variety of applications, including approximate querying, similarity searching, and data mining, to name a few. Histograms were a few of the earliest synopsis structures proposed and continue to be used widely. The histogram construction problem is to construct the best histogram restricted to a space bound that reflects the data distribution most accurately under a given error measure.The histograms are used as quick and easy estimates. Thus, a slight loss of accuracy, compared to the optimal histogram under the given error measure, can be offset by fast histogram construction algorithms. A natural question arises in this context: Can we find a fast near optimal approximation algorithm for the histogram construction problem? In this article, we give the first linear time (1+ϵ)-factor approximation algorithms (for any ϵ > 0) for a large number of histogram construction problems including the use of piecewise small degree polynomials to approximate data, workloads, etc. Several of our algorithms extend to data streams.Using synthetic and real-life data sets, we demonstrate that in many scenarios the approximate histograms are almost identical to optimal histograms in quality and are significantly faster to construct.
Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure can be directly associated with monetary cost. MapReduce has been a popular framework in the context of cloud computing, designed to serve long-running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges for utilizing the processing infrastructure. In this article we present a sharing framework tailored to MapReduce, namely, <tt>MRShare</tt>. Our framework, <tt>MRShare</tt>, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Given the query grouping, we merge jobs appropriately and submit them to MapReduce for processing. A key property of <tt>MRShare</tt> is that it is independent of the MapReduce implementation. Experiments with our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach. <tt>MRShare</tt> is primarily designed for handling I/O-intensive queries. However, with the development of high-level languages operating on top of MapReduce, user queries executed in this model become more complex and CPU intensive. Commonly, executed queries can be modeled as evaluating pipelines of CPU-expensive filters over the input stream. Examples of such filters include, but are not limited to, index probes, or certain types of joins. In this article we adapt some of the standard techniques for filter ordering used in relational and stream databases, propose their extensions, and implement them through <tt>MRAdaptiveFilter</tt>, an extension of <tt>MRShare</tt> for expensive filter ordering tailored to MapReduce, which allows one to handle both single- and batch-query execution modes. We present an experimental evaluation that demonstrates additional benefits of <tt>MRAdaptiveFilter</tt>, when executing CPU-intensive queries in <tt>MRShare</tt>.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.