Random sampling from hash files

Olken, Frank; Rotem, Doron; Xu, Ping‐Feng

doi:10.1145/93597.98746

Cited by 27 publications

(28 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, all variants of reservoir sampling require overwriting random sample items in R, and such overwrites are expensive in flash (see Section 7). Olken and Rotem [18] present techniques for constructing samples in a database environment. However, in addition to not being designed for flash media, the techniques assume we are sampling from disk-resident, indexed data.…”

Section: Related Workmentioning

confidence: 99%

“…One possible approach would be to adapt Olken and Rotem's procedure of batch sampling from a hashed file [18]. The basic idea is first to determine how many samples need to be drawn from each bucket (using a multinomial distribution), and then to draw the target number of samples from each bucket with the acception/rejection algorithm or the reservoir sampling algorithm.…”

Section: Random Subsamplingmentioning

confidence: 99%

See 1 more Smart Citation

Online maintenance of very large random samples on flash storage

Nath

Gibbons

2008

Proc. VLDB Endow.

View full text Add to dashboard Cite

Recent advances in flash media have made it an attractive alternative for data storage in a wide spectrum of computing devices, such as embedded sensors, mobile phones, PDA's, laptops, and even servers. However, flash media has many unique characteristics that make existing data management/analytics algorithms designed for magnetic disks perform poorly with flash storage. For example, while random (page) reads are as fast as sequential reads, random (page) writes and in-place data updates are orders of magnitude slower than sequential writes. In this paper, we consider an important fundamental problem that would seem to be particularly challenging for flash storage: efficiently maintaining a very large (100 MBs or more) random sample of a data stream (e.g., of sensor readings). First, we show that previous algorithms such as reservoir sampling and geometric file are not readily adapted to flash. Second, we propose B-FILE, an energy-efficient abstraction for flash media to store self-expiring items, and show how a B-FILE can be used to efficiently maintain a large sample in flash. Our solution is simple, has a small (RAM) memory footprint, and is designed to cope with flash constraints in order to reduce latency and energy consumption. Third, we provide techniques to maintain biased samples with a B-FILE and to query the large sample stored in a B-FILE for a subsample of an arbitrary size. Finally, we present an evaluation with flash media that shows our techniques are several orders of magnitude faster and more energy-efficient than (flash-friendly versions of) reservoir sampling and geometric file. A key finding of our study, of potential use to many flash algorithms beyond sampling, is that "semi-random" writes (as defined in the paper) on flash cards are over two orders of magnitude faster and more energy-efficient than random writes.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Random Subsamplingmentioning

confidence: 99%

Online maintenance of very large random samples on flash storage

Nath

Gibbons

2008

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…Several papers studied techniques for random sampling from B-trees [24,21,20]. Most of these assume the tree is balanced, and are therefore not efficient for highly unbalanced trees, as is the case with the suggestion TRIE.…”

Section: Related Workmentioning

confidence: 99%

Mining search engine query logs via suggestion sampling

Bar-Yossef

Gurevich²

2008

Proc. VLDB Endow.

View full text Add to dashboard Cite

Many search engines and other web applications suggest auto-completions as the user types in a query. The suggestions are generated from hidden underlying databases, such as query logs, directories, and lexicons. These databases consist of interesting and useful information, but they are typically not directly accessible.In this paper we describe two algorithms for sampling suggestions using only the public suggestion interface. One of the algorithms samples suggestions uniformly at random and the other samples suggestions proportionally to their popularity. These algorithms can be used to mine the hidden suggestion databases. Example applications include comparison of popularity of given keywords within a search engine's query log, estimation of the volume of commerciallyoriented queries in a query log, and evaluation of the extent to which a search engine exposes its users to negative content.Our algorithms employ Monte Carlo methods in order to obtain unbiased samples from the suggestion database. Empirical analysis using a publicly available query log demonstrates that our algorithms are efficient and accurate. Results of experiments on two major suggestion services are also provided.

show abstract

“…In order to produce random samples from such a materialized view, we can employ iterative or batch sampling techniques [16], [18]- [21] that sample directly from a relational selection predicate, thus avoiding the aforementioned problem of obtaining too few relevant records in the sample. Olken [19] presents a comprehensive analysis and comparison of many such techniques.…”

Section: B Sampling From Indicesmentioning

confidence: 99%

“…The classic work in this area (by Olken and his co-authors [16]- [18]) suffers from a key drawback: each record sampled from a database file requires a random disk I/O. At a current rate of around 100 random disk I/Os per second per disk, this means that it is possible to retrieve only 6,000 samples per minute.…”

Section: Introductionmentioning

confidence: 99%

Materialized Sample Views for Database Approximation

Joshi

Jermaine

2006

22nd International Conference on Data Engineering (ICDE'06)

View full text Add to dashboard Cite

We consider the problem of creating a sample view of a database table. A sample view is an indexed, materialized view that permits efficient sampling from an arbitrary range query over the view. Such "sample views" are very useful to applications that require random samples from a database: approximate query processing, online aggregation, data mining, and randomized algorithms are a few examples. Our core technical contribution is a new file organization called the ACE Tree that is suitable for organizing and indexing a sample view. One of the most important aspects of the ACE Tree is that it supports online random sampling from the view. That is, at all times, the set of records returned by the ACE Tree constitutes a statistically random sample of the database records satisfying the relational selection predicate over the view. Our paper presents experimental results that demonstrate the utility of the ACE Tree.

show abstract

Random sampling from hash files

Cited by 27 publications

References 7 publications

Online maintenance of very large random samples on flash storage

Online maintenance of very large random samples on flash storage

Mining search engine query logs via suggestion sampling

Materialized Sample Views for Database Approximation

Contact Info

Product

Resources

About