Scalable training of
            <i>L</i>
            <sup>1</sup>
            -regularized log-linear models

Andrew, Galen; Gao, Jianfeng

doi:10.1145/1273496.1273501

Cited by 378 publications

(367 citation statements)

References 9 publications

Supporting

Mentioning

348

Contrasting

Unclassified

Order By: Relevance

“…We present this strategy as a two phase method, composed of an active set identification phase using an infinitesimal line search, and a subspace minimization phase that utilizes the Hessian subsampling technique. We present numerical results for a sparse version of the speech recognition problem, and we have shown that our algorithm is able to outperform the OWL algorithm presented in [2], both in terms of sparsity and objective value.…”

Section: Final Remarksmentioning

confidence: 95%

“…These methods must, however, perform a vast number of iterations before an appreciable improvement in the objective is obtained, and due to the sequential nature of these iterations, it can be difficult to parallelize them; see [23,1,10] and the references therein. On the other hand, batch (or mini-batch) algorithms can easily exploit parallelism in the function and gradient evaluation, and are able to yield high accuracy in the solution of the optimization problem [30,2,31], if so desired. Motivated by the potential of function/gradient parallelism, the sole focus of this paper is on batch and mini-batch methods.…”

Section: Preliminariesmentioning

confidence: 99%

“…While the orthant identified at every iteration of Algorithm 6.1 coincides with that generated by the OWL method [2], the two algorithms differ in two respects. In the subspace minimization phase, OWL computes a limited memory BFGS step in the space of free variables, while Algorithm 6.1 employs the subsampled Hessian Newton-CG method.…”

Section: Derivation Of the Algorithmmentioning

confidence: 99%

“…The alignment is motivated by the need to achieve global convergence properties, but our computational experience indicates that it slows down the iteration. Moreover, the convergence proof given in [2] is not correct, due to an invalid assumption made in Proposition 4. A simple counter-example can be constructed that contradicts this assumption, based on the non-expansive property of projections and the realignment step.…”

Section: Derivation Of the Algorithmmentioning

confidence: 99%

“…A minimum norm subgradient is employed to predict the zero components of the solution, and a subsampled Hessian Newton-CG method is applied in the subspace phase. We compare the performance of the algorithm with that of the orthant-wise L-BFGS method [2] on a speech recognition problem, and observe that the subspace phase plays an important role in generating sparse solutions quickly.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Sample size selection in optimization methods for machine learning

et al. 2012

View full text Add to dashboard Cite

This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an O(1/ ) complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L 1 regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.

show abstract

Section: Final Remarksmentioning

confidence: 95%

Section: Preliminariesmentioning

confidence: 99%

Section: Derivation Of the Algorithmmentioning

confidence: 99%

Section: Derivation Of the Algorithmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Sample size selection in optimization methods for machine learning

et al. 2012

View full text Add to dashboard Cite

show abstract

Linguistic Annotation

Palmer

Xue

2010

The Handbook of Computational Linguistics and Natural Language Processing

View full text Add to dashboard Cite

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

show abstract