A Survey of Approximate Quantile Computation on Large-Scale Data

Chen, Zhi-Wei; Zhang, Aoqian

doi:10.1109/access.2020.2974919

Cited by 13 publications

(10 citation statements)

References 69 publications

(105 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, the time complexity of the estimation baseline is ( log + 2 ) (i.e., quadratic in number of unique values) and the approximation baseline is ( log 2 ), whereas Algorithm 1 is ( log ) for ∈ {1, 2, ∞} and Algorithm 2 is always ( log ). 1) Limitations: An important limitation is that, while approximate quantile computation algorithms generally support streaming data use cases [20], the algorithms proposed in this paper do not.…”

Section: A Histogram Specification Of Tabular Data Setsmentioning

confidence: 99%

“…That said, much like histogram equalization, one of the subset applications of histogram specification is to perform data normalization by quantile transformation. However, current implementations are based on sample quantile estimation [19], followed by interpolating the cumulative distribution function (CDF) estimate given by the quantiles; or by approximate quantile computation [20], followed by evaluating the approximate CDF. Current implementations may be slow or inexact depending on the number of quantiles estimated or on the quantile approximation algorithm employed.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Histogram Specification by Assignment of Optimal Unique Values

Ramos,

Silveira,

Júnior

2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose two novel algorithms for histogram specification and quantile transformation of data without local information. These are core techniques that can serve as building blocks for applications that require specifying the sample distribution of a given set of data. Histogram specification is best known for its image enhancement applications, whereas quantile transformation is typically employed in data preprocessing for data normalization. In signal processing, methods often require temporal or spatial information; in data preprocessing, methods work by interpolation or by approximation, drawing from results in computational statistics, and have a trade-off between speed and quality. It is nontrivial to accommodate for cases that do not have local information (e.g., tabular data) while also providing a fast, exact solution. For that, we take up a concept in image processing called group mapping law and propose an extension. The proposed extension allows us to formulate a convex functional where we look for the best approximation between the output unique values and the reference histogram. Then, we apply the ordered assignment solution, a result in optimal transport, to reconstruct the output from the optimal unique values. Two sets of results show the effectiveness of the proposed algorithms when compared to traditional and state-of-the-art methods. The proposed algorithms are fast, exact, and least -norm optimal. Further, we define the algorithms as generic data processing methods. Thus, contributions from this paper can be easily incorporated in applications spanning many disciplines, especially in applied data science.

show abstract

Section: A Histogram Specification Of Tabular Data Setsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Histogram Specification by Assignment of Optimal Unique Values

Ramos,

Silveira,

Júnior

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In this case, one would need to estimate the probability of that observation with zero error to get an estimate of the true exposure with also zero error, which clearly requires infinitely many samples. This issue is inherent to several measures that are based on thresholded statistic of a cumulative distribution function (CDF) such as quantiles (Chen & Zhang, 2020). The next section offers a possible remedy.…”

Section: Estimation Of the Exposurementioning

confidence: 99%

Statistical anonymity: Quantifying reidentification risks without reidentifying users

Bravo-Hermsdorff¹,

Busa-Fekete²,

Gunderson³

et al. 2022

Preprint

View full text Add to dashboard Cite

Data anonymization is an approach to privacypreserving data release aimed at preventing participants reidentification, and it is an important alternative to differential privacy in applications that cannot tolerate noisy data. Existing algorithms for enforcing k-anonymity in the released data assume that the curator performing the anonymization has complete access to the original data. Reasons for limiting this access range from undesirability to complete infeasibility. This paper explores ideas -objectives, metrics, protocols, and extensions -for reducing the trust that must be placed in the curator, while still maintaining a statistical notion of k-anonymity. We suggest trust (amount of information provided to the curator) and privacy (anonymity of the participants) as the primary objectives of such a framework. We describe a class of protocols aimed at achieving these goals, proposing new metrics of privacy in the process, and proving related bounds. We conclude by discussing a natural extension of this work that completely removes the need for a central curator.

show abstract

“…All the metrics based on the absolute value of the slopes requires the additional computation of the absolute value of the successive slopes, but this is an inexpensive operation, which requires a number of operations which increases linearly with the number of observations. On the other hand, the (exact) sorting of the observed values used to compute some metrics, like IQR, is O(T log T) computationally complex [39]. Further, normalized metrics (i.e., SEM, CoV, and RoD) need to iterate twice the time series, but this can be done parallel.…”

Section: Variability Assessment Of Irregularly Sampled Datamentioning

confidence: 99%

Assessment of Variability in Irregularly Sampled Time Series: Applications to Mental Healthcare

et al. 2020

View full text Add to dashboard Cite

Variability is defined as the propensity at which a given signal is likely to change. There are many choices for measuring variability, and it is not generally known which ones offer better properties. This paper compares different variability metrics applied to irregularly (nonuniformly) sampled time series, which have important clinical applications, particularly in mental healthcare. Using both synthetic and real patient data, we identify the most robust and interpretable variability measures out of a set 21 candidates. Some of these candidates are also proposed in this work based on the absolute slopes of the time series. An additional synthetic data experiment shows that when the complete time series is unknown, as it happens with real data, a non-negligible bias that favors normalized and/or metrics based on the raw observations of the series appears. Therefore, only the results of the synthetic experiments, which have access to the full series, should be used to draw conclusions. Accordingly, the median absolute deviation of the absolute value of the successive slopes of the data is the best way of measuring variability for this kind of time series.

show abstract

A Survey of Approximate Quantile Computation on Large-Scale Data

Cited by 13 publications

References 69 publications

Histogram Specification by Assignment of Optimal Unique Values

Histogram Specification by Assignment of Optimal Unique Values

Statistical anonymity: Quantifying reidentification risks without reidentifying users

Assessment of Variability in Irregularly Sampled Time Series: Applications to Mental Healthcare

Contact Info

Product

Resources

About