A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks (NNs) have made a theory of learning dynamics elusive. In this work, we show that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian NNs and Gaussian processes (GPs), gradient-based training of wide NNs with a squared loss produces test set predictions drawn from a GP with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.
In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures.
We study the bilinear Hilbert transform and bilinear maximal functions associated to polynomial curves and obtain uniform L r estimates for r > d−1 d and this index is sharp up to the end point.Date: August 19, 2013.In [9], it was shown that T * can be extended to a bounded operator from ℓ 2 × ℓ 2 to ℓ r provided that r > 1. Because there is no transference principle available, it is not clear that Theorem 2 implies any boundedness of T * . A very interesting question is to build up ℓ 2 × ℓ 2 → ℓ 1 estimate for T * , from which the pointwise convergence of the non-conventional dynamic system follows. The circle method and(or) large sieve method are expected to resolve the problem. The method used in this paper essentially works for more general curves on nilpotent groups. We shall not pursue this in this article. Besides the generalisation to more general curves, it is natural to ask whether one can extend the results to the multilinear cases and/or the higher dimensional cases (see [15]). Main Structure of the ProofUnfortunately, the proof of Theorem 1 has to be quite technical, since it involves the uniform bounds. In this section, we sketch the proof of Theorem 1 to present d k=1 a k t k and N be a sufficiently large positive integer, say, N > 2 100d! . For l = 1, . . . , d, we define J l (N ) asThen set J bad (N ) and J good (N ), respectively, to be (2.2) J bad (N ) := d l=1 J l (N ), and J good (N ) := Z\J bad (N ).The J l (N ) may be empty for some l and can be considered essentially as the collection of dyadic numbers at which the l-th term in the polynomial P dominates all other terms. Henceforth, whenever j ∈ J l (N ) and |t| ∼ 2 −j , the polynomial P behaves almost the same as the monomial a l t l . The following lemma asserts that the cardinality of J good (N ) is majorized by a constant which is independent of the coefficients of P (t). This uniform upper bound is crucial in our proof for Theorem 1.Lemma 1. The upper bound of the cardinality of J good (N ) depends only on N and the degree d of P (t), more precisely #J good (N ) ≤ (2(N + 2d) + 1)d(d − 1) + (2N − 1).Proof. The proof is a simple application of pigeonhole principle. Let |a l | = 2 b l . Observe that if j ∈ J good (N ), then there exists two integers 1 ≤ m = n ≤ d such thatThe cardinality of {|j| < N } equals (2N − 1) and by (2.3), the cardinality of J good (N, m, n) is at most (2(N + 2d) + 1) . Noticing that the number of different pairs (m, n) is d(d − 1), the conclusion of the lemma is obvious now.Remark 1. The reader may notice that the upper bound in the above lemma is far from sharp, however it is enough for our application.The quantity N is chosen to depend only on d, p 1 and p 2 , but sufficiently large for the technical reason. Basing on Lemma 1, we will decompose the operator H ΓP (f, g) into two components. First, let 1 t = j∈Z ρ j (t), where ρ j (t) is a smooth odd function, For the BMO estimate (6.29), let J be a dyadic interval of length 2 −J . From the
We obtain sharp estimates for certain trilinear oscillatory integrals, extending Phong and Stein's seminal result to a trilinear setting. This result partially answers a question raised by Christ, Li, Tao and Thiele, concerning sharp estimates for certain multilinear oscillatory integrals. The method in this paper relies on a self-contained algorithm of resolution of singularities in R 2 , which may be of independent interest.
In this paper we establish sharp estimates (up to logarithmic losses) for the multilinear oscillatory integral operator studied by Phong, Stein, and Sturm [14] and Carbery and Wright [2] on any product d j=1 L p j (R) with each p j ≥ 2, expanding the known results for this operator well outside the previous range d j=1 p −1 j = d − 1. Our theorem assumes second-order nondegeneracy condition of Varchenko type, and as a corollary reproduces Varchenko's theorem and implies Fourier decay estimates for measures of smooth density on degenerate hypersurfaces in R d .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.