Cho‐Jui Hsieh scite author profile

In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1-and L2-loss functions. The proposed method is simple and reaches an -accurate solution in O(log(1/ )) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVM perf , and a recent primal coordinate descent implementation.

show abstract

VisualBERT: A Simple and Performant Baseline for Vision and Language

Li¹,

Yatskar²,

Yin³

et al. 2019

Preprint

458

545

View full text Add to dashboard Cite

AutoZOOM: Autoencoder-Based Zeroth Order Optimization Method for Attacking Black-Box Neural Networks

Ting

Chen³

et al. 2019

AAAI

294

267

View full text Add to dashboard Cite

Recent studies have shown that adversarial examples in stateof-the-art image classifiers trained by deep neural networks (DNN) can be easily generated when the target model is transparent to an attacker, known as the white-box setting. However, when attacking a deployed machine learning service, one can only acquire the input-output correspondences of the target model; this is the so-called black-box attack setting. The major drawback of existing black-box attacks is the need for excessive model queries, which may give a false sense of model robustness due to inefficient query designs. To bridge this gap, we propose a generic framework for query-efficient blackbox attacks. Our framework, AutoZOOM, which is short for Autoencoder-based Zeroth Order Optimization Method, has two novel building blocks towards efficient black-box attacks: (i) an adaptive random gradient estimation strategy to balance query counts and distortion, and (ii) an autoencoder that is either trained offline with unlabeled data or a bilinear resizing operation for attack acceleration. Experimental results suggest that, by applying AutoZOOM to a state-of-the-art black-box attack (ZOO), a significant reduction in model queries can be achieved without sacrificing the attack success rate and the visual quality of the resulting adversarial examples. In particular, when compared to the standard ZOO method, AutoZOOM can consistently reduce the mean query counts in finding successful adversarial examples (or reaching the same distortion level) by at least 93% on MNIST, CIFAR-10 and ImageNet datasets, leading to novel insights on adversarial robustness.

show abstract

Towards Robust Neural Networks via Random Self-ensemble

et al. 2018

View full text Add to dashboard Cite

Recent studies have revealed the vulnerability of deep neural networks: A small adversarial perturbation that is imperceptible to human can easily make a well-trained deep neural network misclassify. This makes it unsafe to apply neural networks in security-critical applications. In this paper, we propose a new defense algorithm called Random Self-Ensemble (RSE) by combining two important concepts: randomness and ensemble. To protect a targeted model, RSE adds random noise layers to the neural network to prevent the strong gradient-based attacks, and ensembles the prediction over random noises to stabilize the performance. We show that our algorithm is equivalent to ensemble an infinite number of noisy models f without any additional memory overhead, and the proposed training procedure based on noisy stochastic gradient descent can ensure the ensemble model has a good predictive capability. Our algorithm significantly outperforms previous defense techniques on real data sets. For instance, on CIFAR-10 with VGG network (which has 92% accuracy without any attack), under the strong C&W attack within a certain distortion tolerance, the accuracy of unprotected model drops to less than 10%, the best previous defense technique has 48% accuracy, while our method still has 86% prediction accuracy under the same level of attack. Finally, our method is simple and easy to integrate into any neural network.

show abstract

ImageNet Training in Minutes

Zhang²,

et al. 2018

View full text Add to dashboard Cite

Since its creation, the ImageNet-1k benchmark set has played a significant role as a benchmark for ascertaining the accuracy of different deep neural net (DNN) models on the classification problem. Moreover, in recent years it has also served as the principal benchmark for assessing different approaches to DNN training. Finishing a 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires 10 18 single precision operations in total. On the other hand, the world's current fastest supercomputer can finish 2 × 10 17 single precision operations per second. If we can make full use of the computing capability of the fastest supercomputer for DNN training, we should be able to finish the 90-epoch ResNet-50 training in five seconds. Over the last two years, a number of researchers have focused on closing this significant performance gap through scaling DNN training to larger numbers of processors. Most successful approaches to scaling ImageNet training have used the synchronous stochastic gradient descent. However, to scale synchronous stochastic gradient descent one must also increase the batch size used in each iteration. Thus, for many researchers, the focus on scaling DNN training has translated into a focus on developing training algorithms that enable increasing the batch size in data-parallel synchronous stochastic gradient descent without losing accuracy over a fixed number of epochs. As a result, we have seen the batch size and number of processors successfully utilized increase from 1K batch size on 128 processors to 8K batch size on 256 processors over the last two years. The recently published LARS algorithm increased batch size further to 32K for some DNN models. Following up on this work, we wished to confirm that LARS could be used to further scale the number of processors efficiently used in DNN training and, and as a result, further reduce the total training time. In this paper we present the results of this investigation: using LARS we efficiently utilized 1024 CPUs to finish the 100-epoch ImageNet training with AlexNet in 11 minutes with 58.6% accuracy (batch size = 32K), and we utilized 2048 KNLs to finish the 90-epoch ImageNet training with ResNet-50 in 20 minutes without losing accuracy (batch size = 32K). State-of-the-art ImageNet training speed with ResNet-50 is 74.9% top-1 test accuracy in 15 minutes (Akiba, Suzuki, and Fukuda 2017). We got 74.9% top-1 test accuracy in 64 epochs, which only needs 14 minutes. Furthermore, when the batch size is above 16K, our accuracy using LARS is much higher than Facebooks corresponding batch sizes (Figure 1). Our code is available upon request. 15 mins Our version 32K 74.9% 14 mins arXiv:1709.05011v10 [cs.CV]

show abstract

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Hsieh

et al. 2012

191

153

View full text Add to dashboard Cite

Abstract-Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updates rank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines.

show abstract

Fast coordinate descent methods with variable selection for non-negative matrix factorization

2011

View full text Add to dashboard Cite

Nonnegative Matrix Factorization (NMF) is an effective dimension reduction method for non-negative dyadic data, and has proven to be useful in many areas, such as text mining, bioinformatics and image processing. NMF is usually formulated as a constrained non-convex optimization problem, and many algorithms have been developed for solving it. Recently, a coordinate descent method, called FastHals [3], has been proposed to solve least squares NMF and is regarded as one of the state-of-the-art techniques for the problem. In this paper, we first show that FastHals has an inefficiency in that it uses a cyclic coordinate descent scheme and thus, performs unneeded descent steps on unimportant variables. We then present a variable selection scheme that uses the gradient of the objective function to arrive at a new coordinate descent method. Our new method is considerably faster in practice and we show that it has theoretical convergence guarantees. Moreover when the solution is sparse, as is often the case in real applications, our new method benefits by selecting important variables to update more often, thus resulting in higher speed. As an example, on a text dataset RCV1, our method is 7 times faster than FastHals, and more than 15 times faster when the sparsity is increased by adding an L1 penalty. We also develop new coordinate descent methods when error in NMF is measured by KLdivergence by applying the Newton method to solve the one-variable sub-problems. Experiments indicate that our algorithm for minimizing the KL-divergence is faster than the Lee & Seung multiplicative rule by a factor of 10 on the CBCL image dataset.

show abstract

Low rank modeling of signed networks

2012

View full text Add to dashboard Cite

Trust networks, where people leave trust and distrust feedback, are becoming increasingly common. These networks may be regarded as signed graphs, where a positive edge weight captures the degree of trust while a negative edge weight captures the degree of distrust. Analysis of such signed networks has become an increasingly important research topic. One important analysis task is that of sign inference, i.e., infer unknown (or future) trust or distrust relationships given a partially observed signed network. Most state-of-the-art approaches consider the notion of structural balance in signed networks, building inference algorithms based on information about links, triads, and cycles in the network. In this paper, we first show that the notion of weak structural balance in signed networks naturally leads to a global low-rank model for the network. Under such a model, the sign inference problem can be formulated as a low-rank matrix completion problem. We show that we can perfectly recover missing relationships, under certain conditions, using state-of-the-art matrix completion algorithms. We also propose the use of a low-rank matrix factorization approach with generalized loss functions as a practical method for sign inference -this approach yields high accuracy while being scalable to large signed networks, for instance, we show that this analysis can be performed on a synthetic graph with 1.1 million nodes and 120 million edges in 10 minutes. We further show that the low-rank model can be used for other analysis tasks on signed networks, such as user segmentation through signed graph clustering, with theoretical guarantees. Experiments on synthetic as well as real data show that our low rank model substantially improves accuracy of sign inference as well as clustering. As an example, on the largest real dataset available to us (Epinions data with 130K nodes and 840K edges), our matrix factorization approach yields 94.6% accuracy on the sign inference task as compared to 90.8% accuracy using a state-of-the-art cyclebased method -moreover, our method runs in 40 seconds as compared to 10,000 seconds for the cycle-based method.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Cho‐Jui Hsieh

A dual coordinate descent method for large-scale linear SVM

VisualBERT: A Simple and Performant Baseline for Vision and Language

AutoZOOM: Autoencoder-Based Zeroth Order Optimization Method for Attacking Black-Box Neural Networks

Towards Robust Neural Networks via Random Self-ensemble

ImageNet Training in Minutes

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Fast coordinate descent methods with variable selection for non-negative matrix factorization

Low rank modeling of signed networks

Contact Info

Product

Resources

About