Optimal histograms for limiting worst-case error propagation in the size of join results

Ioannidis, Yannis; Christodoulakis, Stavros

doi:10.1145/169725.169708

Cited by 94 publications

(62 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first results that led towards new types of histograms were derived in an effort to obtain statistics that would be optimal in minimizing/containing the propagation of errors in the size of join results [37]. The basic mathematical tools used were borrowed from majorization theory [55].…”

Section: Optimal Sort Parametermentioning

confidence: 99%

The History of Histograms (abridged)

Ioannidis

2003

Proceedings 2003 VLDB Conference

322

230

View full text Add to dashboard Cite

The history of histograms is long and rich, full of detailed information in every step. It includes the course of histograms in different scientific fields, the successes and failures of histograms in approximating and compressing information, their adoption by industry, and solutions that have been given on a great variety of histogram-related problems. In this paper and in the same spirit of the histogram techniques themselves, we compress their entire history (including their "future history" as currently anticipated) in the given/fixed space budget, mostly recording details for the periods, events, and results with the highest (personally-biased) interest. In a limited set of experiments, the semantic distance between the compressed and the full form of the history was found relatively small! PrehistoryThe word 'histogram' is of Greek origin, as it is a composite of the words 'isto-s' (ιστ os) (= 'mast', also means 'web' but this is not relevant to this discussion) and 'gram-ma' (γραµµα) (= 'something written'). Hence, it should be interpreted as a form of writing consisting of 'masts', i.e., long shapes vertically standing, or something similar. It is not, however, a

show abstract

Section: Optimal Sort Parametermentioning

confidence: 99%

The History of Histograms (abridged)

Ioannidis

2003

Proceedings 2003 VLDB Conference

322

230

View full text Add to dashboard Cite

show abstract

“…-V-optimal [8,7,9]: Partition data such that β j=1 nj k=1 (f j − f j,k ) 2 is minimized, where β is the number of buckets, n j is the number of entries in the jth bucket, f j is the average frequency of jth bucket, and f j,k is the kth frequency of jth bucket.…”

Section: Existing Histogram Techniquesmentioning

confidence: 99%

“…It has been shown [12] that this technique out-performed those "conventional" histogram techniques [7,8,9,13,17]. To compliment the work in [12], in this paper we will propose a novel optimization model for generating linear-spline based histograms.…”

Section: Introductionmentioning

confidence: 96%

See 1 more Smart Citation

On Linear-Spline Based Histograms

Zhang

Lin

2002

Advances in Web-Age Information Management

View full text Add to dashboard Cite

Abstract. Approximation is a very effective paradigm to speed up query processing in large databases. One popular approximation mechanism is data size reduction. There are three reduction techniques: sampling, histograms, and wavelets. Histogram techniques are supported by many commercial database systems, and have been shown very effective for approximately processing aggregation queries. In this paper, we will investigate the optimal models for building histograms based on linear spline techniques. We will firstly propose several novel models. Secondly, we will present efficient algorithms to achieve these proposed optimal models. Our experiment results showed that our new techniques can greatly improve the approximation accuracy comparing to the existing techniques.

show abstract

“…Histogram H1 in the earlier table is not serial as frequencies 1 and 3 appear in one bucket and frequency 2 appears in the other, while histogram H2 is. Under various optimality criteria, serial histograms have been shown to be optimal for reducing the worst-case and the average error in equality selection and join queries IC93,Ioa93,IP95 . Identifying the optimal histogram among all serial ones takes exponential time in the number of buckets. Moreover, since there is usually no order-correlation between attribute values and their frequencies, storage of serial histograms essentially requires a regular index that will lead to the approximate frequency of every individual attribute value.…”

Section: Size-distribution Estimatormentioning

confidence: 99%