David P. Woodruff scite author profile

We design a new distribution over m × n matrices S so that, for any fixed n × d matrix A of rank r , with probability at least 9/10, ∥ SAx ∥ 2 = (1 ± ε)∥ Ax ∥ 2 simultaneously for all x ∈ R d . Here, m is bounded by a polynomial in r ε − 1 , and the parameter ε ∈ (0, 1]. Such a matrix S is called a subspace embedding . Furthermore, SA can be computed in O (nnz( A )) time, where nnz( A ) is the number of nonzero entries of A . This improves over all previous subspace embeddings, for which computing SA required at least Ω( nd log d ) time. We call these S sparse embedding matrices . Using our sparse embedding matrices, we obtain the fastest known algorithms for overconstrained least-squares regression, low-rank approximation, approximating all leverage scores, and ℓ p regression. More specifically, let b be an n × 1 vector, ε > 0 a small enough value, and integers k , p ⩾ 1. Our results include the following. — Regression: The regression problem is to find d × 1 vector x ′ for which ∥ Ax ′ − b ∥ p ⩽ (1 + ε)min x ∥ Ax − b ∥ p . For the Euclidean case p = 2, we obtain an algorithm running in O (nnz( A )) + Õ ( d 3 ε −2 ) time, and another in O (nnz( A )log(1/ε)) + Õ ( d 3 log (1/ε)) time. (Here, Õ ( f ) = f ċ log O (1) ( f ).) For p ∈ [1, ∞), more generally, we obtain an algorithm running in O (nnz( A ) log n ) + O ( r \ε −1 ) C time, for a fixed C . — Low-rank approximation: We give an algorithm to obtain a rank- k matrix Â k such that ∥ A − Â k ∥ F ≤ (1 + ε )∥ A − A k ∥ F , where A k is the best rank- k approximation to A . (That is, A k is the output of principal components analysis, produced by a truncated singular value decomposition, useful for latent semantic indexing and many other statistical problems.) Our algorithm runs in O (nnz( A )) + Õ ( nk 2 ε −4 + k 3 ε −5 ) time. — Leverage scores: We give an algorithm to estimate the leverage scores of A , up to a constant factor, in O (nnz( A )log n ) + Õ ( r 3 )time.

show abstract

An optimal algorithm for the distinct elements problem

Kane

Nelson²,

Woodruff

2010

268

282

View full text Add to dashboard Cite

We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1, . . . , n}, our algorithm computes a (1 ± ε)-approximation using an optimal O(ε −2 +log(n)) bits of space with 2/3 success probability, where 0 < ε < 1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously.We also give an algorithm to estimate the Hamming norm of a stream, a generalization of the number of distinct elements, which is useful in data cleaning, packet tracing, and database auditing. Our algorithm uses nearly optimal space, and has optimal O(1) update and reporting times.

show abstract

Numerical linear algebra in the streaming model

2009

View full text Add to dashboard Cite

We give near-optimal space bounds in the streaming model for linear algebra problems that include estimation of matrix products, linear regression, low-rank approximation, and approximation of matrix rank. In the streaming model, sketches of input matrices are maintained under updates of matrix entries; we prove results for turnstile updates, given in an arbitrary order. We give the first lower bounds known for the space needed by the sketches, for a given estimation error . We sharpen prior upper bounds, with respect to combinations of space, failure probability, and number of passes. The sketch we use for matrix A is simply S T A, where S is a sign matrix.Our results include the following upper and lower bounds on the bits of space needed for 1-pass algorithms. Here A is an n × d matrix, B is an n × d matrix, and c := d + d . These results are given for fixed failure probability; for failure probability δ > 0, the upper bounds require a factor of log(1/δ) more space. We assume the inputs have integer entries specified by O(log(nc)) bits, or O(log(nd)) bits. (Matrix Product) Output matrix C withWe show that Θ(c −2 log(nc)) space is needed. (Linear Regression) ForWe show that Θ(d 2 −1 log(nd)) space is needed.3. (Rank-k Approximation) Find matrixÃ k of rank no more than k, so thatwhere A k is the best rank-k approximation to A. Our lower bound is Ω(k −1 (n + d) log(nd)) space, and we give a one-pass algorithm matching this when A is given row-wise or column-wise. For general updates, we give a one-pass algorithm needing O(k −2 (n + d/ 2 ) log(nd)) space. We also give upper and lower bounds for algorithms using multiple passes, and a sketching analog of the CU R decomposition.

show abstract

Optimal approximations of the frequency moments of data streams

2005

View full text Add to dashboard Cite

We give a 1-passÕ(m 1−2/k )-space algorithm for computing the k-th frequency moment of a data stream for any real k > 2. Together with the lower bounds of [1, 2, 4], this resolves the main problem left open by Alon et al in 1996 [1]. Our algorithm also works for streams with deletions and thus gives anÕ(m 1−2/p ) space algorithm for the Lp difference problem for any p > 2. This essentially matches the known Ω(m 1−2/p−o(1) ) lower bound of [12,2]. Finally the update time of our algorithm isÕ(1).

show abstract

Low rank approximation and regression in input sparsity time

2013

View full text Add to dashboard Cite

We design a new distribution over poly(rε −1 ) × n matrices S so that for any fixed n × d matrix A of rank r, with probability at least 9/10, SAx 2 = (1 ± ε) Ax 2 simultaneously for all x ∈ R d . Such a matrix S is called a subspace embedding. Furthermore, SA can be computed in O(nnz(A))time, where nnz(A) is the number of non-zero entries of A. This improves over all previous subspace embeddings, which required at least Ω(nd log d) time to achieve this property. We call our matrices S sparse embedding matrices.Using our sparse embedding matrices, we obtain the fastest known algorithms for overconstrained least-squares regression, low-rank approximation, approximating all leverage scores, and p-regression:for an n × d matrix A and an n × 1 column vector b, we obtain an algorithm running in O(nnz(A)) + O(d 3 ε −2 ) time, and another in O(nnz(A) log(1/ε)) +Õ(d 3 log(1/ε)) time.(HereÕ(f ) = f · log O(1) (f ).)• to obtain a decomposition of an n × n matrix A into a product of an n × k matrix L, a k × k diagonal matrix D, and an n × k matrix W , for whichwhere A k is the best rank-k approximation, our algorithm runs in O(nnz(A)) +Õ(nk 2 ε −4 + k 3 ε −5 ) time.• to output an approximation to all leverage scores of an n × d input matrix A simultaneously, with constant relative error, our algorithms run in O(nnz(A) log n) +Õ(r 3 ) time.• to output an x for whichfor an n×d matrix A and an n×1 column vector b, we obtain an algorithm running in O(nnz(A) log n)+ poly(rε −1 ) time, for any constant 1 ≤ p < ∞.We optimize the polynomial factors in the above stated running times, and show various tradeoffs. Finally, we provide preliminary experimental results which suggest that our algorithms are of interest in practice.

show abstract

1-Pass Relative-Error L_p-Sampling with Applications

Monemizadeh¹,

Woodruff²

2010

132

View full text Add to dashboard Cite

On the Exact Space Complexity of Sketching and Streaming Small Norms

Kane¹,

Nelson²,

Woodruff³

2010

126

View full text Add to dashboard Cite

We settle the 1-pass space complexity of (1 ± ε)-approximating the L p norm, for real p with 1 ≤ p ≤ 2, of a length-n vector updated in a length-m stream with updates to its coordinates. We assume the updates are integers in the range [−M, M ]. In particular, we show the space required is Θ(ε −2 log(mM ) + log log(n)) bits. Our result also holds for 0 < p < 1; although L p is not a norm in this case, it remains a well-defined function. Our upper bound improves upon previous algorithms of [Indyk, JACM '06] and [Li, SODA '08]. This improvement comes from showing an improved derandomization of the L p sketch of Indyk by using k-wise independence for small k, as opposed to using the heavy hammer of a generic pseudorandom generator against space-bounded computation such as Nisan's PRG. Our lower bound improves upon previous work of [Alon-Matias-Szegedy, JCSS '99] and [Woodruff, SODA '04], and is based on showing a direct sum property for the 1-way communication of the gap-Hamming problem.

show abstract

Communication lower bounds for statistical estimation problems via a distributed data processing inequality

et al. 2016

View full text Add to dashboard Cite

We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the m machines receives n data points from a d-dimensional Gaussian distribution with unknown mean θ which is promised to be k-sparse. The machines communicate by message passing and aim to estimate the mean θ. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed sparse linear regression problem: to achieve the statistical minimax error, the total communication is at least Ω(min{n, d}m), where n is the number of observations that each machine receives and d is the ambient dimension. These lower results improve upon [Sha14, SD15] by allowing multi-round iterative communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation.As our main technique, we prove a distributed data processing inequality, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

David P. Woodruff

Low-Rank Approximation and Regression in Input Sparsity Time

An optimal algorithm for the distinct elements problem

Numerical linear algebra in the streaming model

Optimal approximations of the frequency moments of data streams

Low rank approximation and regression in input sparsity time

1-Pass Relative-Error L_p-Sampling with Applications

On the Exact Space Complexity of Sketching and Streaming Small Norms

Communication lower bounds for statistical estimation problems via a distributed data processing inequality

Contact Info

Product

Resources

About

David P. Woodruff

Low-Rank Approximation and Regression in Input Sparsity Time

An optimal algorithm for the distinct elements problem

Numerical linear algebra in the streaming model

Optimal approximations of the frequency moments of data streams

Low rank approximation and regression in input sparsity time

1-Pass Relative-Error Lp-Sampling with Applications

On the Exact Space Complexity of Sketching and Streaming Small Norms

Communication lower bounds for statistical estimation problems via a distributed data processing inequality

Contact Info

Product

Resources

About

1-Pass Relative-Error L_p-Sampling with Applications