2017
DOI: 10.48550/arxiv.1711.00501
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning One-hidden-layer Neural Networks with Landscape Design

Rong Ge,
Jason D. Lee,
Tengyu Ma

Abstract: We consider the problem of learning a one-hidden-layer neural network: we assume the input x ∈ R d is from Gaussian distribution and the label y = a ⊤ σ(Bx) + ξ, where a is a nonnegative vector in R m with m ≤ d, B ∈ R m×d is a full-rank weight matrix, and ξ is a noise vector. We first give an analytic formula for the population risk of the standard squared loss and demonstrate that it implicitly attempts to decompose a sequence of low-rank tensors simultaneously.Inspired by the formula, we design a non-convex… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

5
110
1
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 57 publications
(117 citation statements)
references
References 7 publications
5
110
1
1
Order By: Relevance
“…Nevertheless, for many modern ML models such as CNNs, (P-DALE) remains a non-convex program in θ. And while there is overwhelming theoretical and empirical evidence that stochastic gradientbased algorithms yield good local minimizers for such overparametrized problems [35][36][37][38][39], the fact remains that solving (P-DALE) requires us to evaluate an expectation with respect to λ , which is challenging due to the fact that µ n and γ n are not known a priori. In the remainder of this section, we propose a practical algorithm to solve (P-DALE) based on the approximation discussed in Section 3.…”
Section: Dual Robust Learning Algorithmmentioning
confidence: 99%
See 2 more Smart Citations
“…Nevertheless, for many modern ML models such as CNNs, (P-DALE) remains a non-convex program in θ. And while there is overwhelming theoretical and empirical evidence that stochastic gradientbased algorithms yield good local minimizers for such overparametrized problems [35][36][37][38][39], the fact remains that solving (P-DALE) requires us to evaluate an expectation with respect to λ , which is challenging due to the fact that µ n and γ n are not known a priori. In the remainder of this section, we propose a practical algorithm to solve (P-DALE) based on the approximation discussed in Section 3.…”
Section: Dual Robust Learning Algorithmmentioning
confidence: 99%
“…What is more, maximizing over δ in the definition of adv is a severely underparametrized problem as opposed to the minimization over θ in (P-RO). It therefore does not enjoy the same benign optimization landscape [35][36][37][38][39]. Additionally, note that there is no guarantee that this alternating optimization technique converges.…”
Section: A2 Sampling Vs Optimizing Pertubationsmentioning
confidence: 99%
See 1 more Smart Citation
“…Instead, many theoretical works focus on finding a local minimum instead of a global one, because recent works (both empirical and theoretical) suggested that local minima are nearly as good as global minima for a significant amount of well-studied machine learning problems; see e.g. [4,11,13,14,16,17]. On the other hand, saddle points are major obstacles for solving these problems, not only because they are ubiquitous in high-dimensional settings where the directions for escaping may be few (see e.g.…”
Section: Introductionmentioning
confidence: 99%
“…Towards mitigating the degradation, we identify a critical issue in CQL: solely regularizing the critic is insufficient for multiple agents to learn good policies for coordination in the offline setting. The primary cause is that first-order policy gradient methods are prone to local optima [14,36,46], saddle points [52,54], or noisy gradient estimates [51]. As a result, this can lead to uncoordinated suboptimal learning behavior because the actor cannot leverage the global information in the critic well.…”
Section: Introductionmentioning
confidence: 99%