At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit (14; 11), thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function f θ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinitewidth limit, the network function f θ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.
Hot tearing in castings is closely related to the difficulty of bridging or coalescence of dendrite arms during the last stage of solidification. The details of the process determine the temperature at which a coherent solid forms; i.e., a solid that can sustain tensile stresses. Based on the disjoining-pressure concept used in fluid dynamics, a theoretical framework is established for the coalescence of primaryphase dendritic arms within a single grain or at grain boundaries. For pure substances, approaching planar liquid/solid interfaces coalesce to a grain boundary at an undercooling (⌬T b ), given bywhere ␦ is the thickness of an isolated solid-liquid interface, and ⌬⌫ b is the difference between the grain-boundary energy, ␥ gb , and twice the solid/liquid interfacial energy, 2␥ sl , divided by the entropy of fusion. If ␥ gb Ͻ 2␥ sl , then ⌬T b Ͻ 0 and the liquid film is unstable. Coalescence occurs as soon as the two interfaces get close enough (at a distance on the order of ␦ ). This situation, typical of dendrite arms belonging to the same grain (i.e., ␥ gb ϭ 0), is referred to as "attractive". The situation where ␥ gb ϭ 2␥ sl is referred to as "neutral"; i.e., coalescence occurs at zero undercooling. If ␥ gb Ͼ 2␥ sl , the two liquid/solid interfaces are "repulsive" and ⌬T b Ͼ 0. In this case, a stable liquid film between adjacent dendrite arms located across such grain boundaries can remain until the undercooling exceeds ⌬T b . For alloys, coalescence is also influenced by the concentration of the liquid film. The temperature and concentration of the liquid film must reach a coalescence line parallel to, but ⌬T b below, the liquidus line before coalescence can occur. Using one-dimensional (1-D) interface tracking calculations, diffusion in the solid phase perpendicular to the interface (backdiffusion) is shown to aid the coalescence process. To study the interaction of interface curvature and diffusion in the liquid film parallel to the interface, a multiphase-field approach has been used. After validating the method with the 1-D interface tracking results for pure substances and alloys, it is then applied to twodimensional (2-D) situations for binary alloys. The coalescence process is shown to originate in small necks and involve rapidly changing liquid/solid interface curvatures.
Supervised deep learning involves the training of neural networks with a large number N of parameters. For large enough N , in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as N grows past a certain threshold N * . Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with N . We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations f N −f N ∼ N −1/4 of the neural net output function f N around its expectationf N . These affect the generalization error N for classification: under natural assumptions, it decays to a plateau value ∞ in a power-law fashion ∼ N −1/2 . This description breaks down at a so-called jamming transition N = N * . At this threshold, we argue that f N diverges. This result leads to a plausible explanation for the cusp in test error known to occur at N * . Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond N * , and averaging their outputs. arXiv:1901.01608v5 [cond-mat.dis-nn]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.