$\ell_1$ Regression using Lewis Weights Preconditioning and Stochastic Gradient Descent

Durfee, David; Lai, Kevin A.; Sawlani, Saurabh

doi:10.48550/arxiv.1708.07821

Cited by 1 publication

(4 citation statements)

References 14 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, the complexity of first-order methods usually has dependences on various parameters of the input matrix A, which can be unbounded in the worst case. Fortunately, recent developments in ℓ 1 regression [32] show how to precondition the matrix A by simply doing an ℓ 1 Lewis weights sampling, and then rotating the matrix appropriately. By carefully combining this preconditioning procedure with Accelerated Gradient Descent, we obtain an algorithm for (1+ε)-approximate ℓ 1 regression with communication complexity O(sd 3 L/ε) in the coordinator model, which shows it is indeed possible to improve the ε dependence for ℓ 1 regression.…”

Section: Linear Regressionmentioning

confidence: 99%

“…), we use Lemma 29 in [32], which states that with constant probability, the leverage scores of SA satisfy τ i (SA) = O(d/N ) for all i. Since leverage scores are invariant under change of basis (see, e.g., Section 2.4 in [65]), we have for all i,…”

Section: Use the Protocol Inmentioning

confidence: 99%

“…where we assume SAR −1 x − Sb 1 is G-Lipschitz continuous and the initial solution x satisfies x − x * 2 2 ≤ Θ. Since SAR −1 is approximately isotropic row-bounded and the initial vector x is the optimal solution to the ℓ 2 regression problem min x SAR −1 x − Sb 2 2 , Lemma 19 in [32] shows that x − x * 2 ≤ d/n Ax * − b 1 . Furthermore, Lemma 15 in [32] shows that G ≤ √ nd.…”

Section: Use the Protocol Inmentioning

confidence: 99%

“…Since SAR −1 is approximately isotropic row-bounded and the initial vector x is the optimal solution to the ℓ 2 regression problem min x SAR −1 x − Sb 2 2 , Lemma 19 in [32] shows that x − x * 2 ≤ d/n Ax * − b 1 . Furthermore, Lemma 15 in [32] shows that G ≤ √ nd. By setting δ = ε Ax * − b 1 , we can calculate a (1 + ε)-approximate solution to the ℓ 1 regression problem using O(d/ε) full gradient calculations.…”

Section: Use the Protocol Inmentioning

confidence: 99%

See 3 more Smart Citations

The Communication Complexity of Optimization

Vempala¹,

Wang

Woodruff

2020

Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

We consider the communication complexity of a number of distributed optimization problems. We start with the problem of solving a linear system. Suppose there is a coordinator together with s servers P 1 , . . . , P s , the i-th of which holds a subset A (i) x = b (i) of n i constraints of a linear system in d variables, and the coordinator would like to output an x ∈ R d for which A (i) x = b (i) for i = 1, . . . , s. We assume each coefficient of each constraint is specified using L bits. We first resolve the randomized and deterministic communication complexity in the point-to-point model of communication, showing it is Θ(d 2 L + sd) and Θ(sd 2 L), respectively. We obtain similar results for the blackboard communication model. As a result of independent interest, we show the probability a random matrix with integer entries inWhen there is no solution to the linear system, a natural alternative is to find the solution minimizing the ℓ p loss, which is the ℓ p regression problem. While this problem has been studied, we give improved upper or lower bounds for every value of p ≥ 1. One takeaway message is that sampling and sketching techniques, which are commonly used in earlier work on distributed optimization, are neither optimal in the dependence on d nor on the dependence on the approximation ε, thus motivating new techniques from optimization to solve these problems.Towards this end, we consider the communication complexity of optimization tasks which generalize linear systems, such as linear, semidefinite, and convex programming. For linear programming, we first resolve the communication complexity when d is constant, showing it is Θ(sL) in the point-to-point model. For general d and in the point-to-point model, we show an O(sd 3 L) upper bound and an Ω(d 2 L + sd) lower bound. In fact, we show if one perturbs the coefficients randomly by numbers as small as 2 −Θ(L) , then the upper bound is O(sd 2 L)+poly(dL), and so this bound holds for almost all linear programs. Our study motivates understanding the bit complexity of linear programming, which is related to the running time in the unit cost RAM model with words of O(log(nd)) bits, and we give the fastest known algorithms for linear programming in this model. * Santosh S. Vempala was supported in part by NSF awards CCF-1717349 and DMS-1839323. Ruosong Wang and David P. Woodruff were supported in part by Office of Naval Research (ONR) grant N00014-18-1-2562. Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing.Large-scale optimization problems often cannot fit into a single machine, and so they are distributed across a number s of machines. That is, each of servers P 1 , . . . , P s may hold a subset of constraints that it is given locally as input, and the goal of the servers is to communicate with each other to find a solution satisfying all constraints. Since communication is often a bottleneck in distributed computation, the goal of the servers is to communicate as little as possible.There are sever...

show abstract

Section: Linear Regressionmentioning

confidence: 99%

Section: Use the Protocol Inmentioning

confidence: 99%