2020
DOI: 10.1109/access.2020.2974919
|View full text |Cite
|
Sign up to set email alerts
|

A Survey of Approximate Quantile Computation on Large-Scale Data

Abstract: As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algor… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 13 publications
(10 citation statements)
references
References 69 publications
(105 reference statements)
0
10
0
Order By: Relevance
“…Moreover, the time complexity of the estimation baseline is ( log + 2 ) (i.e., quadratic in number of unique values) and the approximation baseline is ( log 2 ), whereas Algorithm 1 is ( log ) for ∈ {1, 2, ∞} and Algorithm 2 is always ( log ). 1) Limitations: An important limitation is that, while approximate quantile computation algorithms generally support streaming data use cases [20], the algorithms proposed in this paper do not.…”
Section: A Histogram Specification Of Tabular Data Setsmentioning
confidence: 99%
See 1 more Smart Citation
“…Moreover, the time complexity of the estimation baseline is ( log + 2 ) (i.e., quadratic in number of unique values) and the approximation baseline is ( log 2 ), whereas Algorithm 1 is ( log ) for ∈ {1, 2, ∞} and Algorithm 2 is always ( log ). 1) Limitations: An important limitation is that, while approximate quantile computation algorithms generally support streaming data use cases [20], the algorithms proposed in this paper do not.…”
Section: A Histogram Specification Of Tabular Data Setsmentioning
confidence: 99%
“…That said, much like histogram equalization, one of the subset applications of histogram specification is to perform data normalization by quantile transformation. However, current implementations are based on sample quantile estimation [19], followed by interpolating the cumulative distribution function (CDF) estimate given by the quantiles; or by approximate quantile computation [20], followed by evaluating the approximate CDF. Current implementations may be slow or inexact depending on the number of quantiles estimated or on the quantile approximation algorithm employed.…”
Section: Introductionmentioning
confidence: 99%
“…In this case, one would need to estimate the probability of that observation with zero error to get an estimate of the true exposure with also zero error, which clearly requires infinitely many samples. This issue is inherent to several measures that are based on thresholded statistic of a cumulative distribution function (CDF) such as quantiles (Chen & Zhang, 2020). The next section offers a possible remedy.…”
Section: Estimation Of the Exposurementioning
confidence: 99%
“…All the metrics based on the absolute value of the slopes requires the additional computation of the absolute value of the successive slopes, but this is an inexpensive operation, which requires a number of operations which increases linearly with the number of observations. On the other hand, the (exact) sorting of the observed values used to compute some metrics, like IQR, is O(T log T) computationally complex [39]. Further, normalized metrics (i.e., SEM, CoV, and RoD) need to iterate twice the time series, but this can be done parallel.…”
Section: Variability Assessment Of Irregularly Sampled Datamentioning
confidence: 99%