Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates

Zhang, Jingwei; Huang, Xunpeng; Yu, Jincheng

doi:10.2139/ssrn.4416411

Cited by 1 publication

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We parameterize the policy π by a neural network F with parameter

\theta =(\pmb {W}, \pmb {b})

, that is,

a\sim \pi (s,a;\theta )=f(F((s,a);\theta ))

for some function f . A popular choice of f is given by

\begin{equation*} f(F((s,a);\theta ))=\frac{\exp (\tau F((s,a);\theta ))}{\sum _{a^\prime \in \mathcal {A}}\exp (\tau F((s,a^\prime );\theta ))}, \end{equation*}

for some parameter τ, which gives an energy‐based policy (see, e.g., Haarnoja et al., 2017; Wang et al., 2020). The policy parameter θ is updated using the gradient ascent rule given by

\begin{equation*} \theta ^{(n+1)} = \theta ^{(n)} + \beta \widehat{\nabla _{\theta }J(\theta ^{(n)})}, \qquad n=0,\ldots ,N-1, \end{equation*}

where

\widehat{\nabla _{\theta }J(\theta ^{(n)})}

is an estimate of the policy gradient.…”

Section: Deep Reinforcement Learningmentioning

confidence: 99%

“…Using neural networks to parametrize the policy and/or value functions in the vanilla version of policy‐based methods discussed in Section 2.4 leads to neural Actor–Critic algorithms (Wang et al., 2020), neural PPO/TRPO (Liu et al., 2019), and deep DPG (DDPG) (Lillicrap et al., 2016). In addition, since introducing an entropy term in the objective function encourages policy exploration (Haarnoja et al., 2017) and speeds the learning process (Haarnoja et al., 2018; Mei et al., 2020) (as discussed in Section 2.5.4), there have been some recent developments in (off‐policy) soft Actor–Critic algorithms (Haarnoja et al., 2018), (Haarnoja et al., 2018) using neural networks, which solve the RL problem with entropy regularization.…”

Section: Deep Reinforcement Learningmentioning

confidence: 99%

“…(2019) provided a mean‐squared sample complexity for neural PPO and TRPO algorithms with sublinear convergence rate; Wang et al. (2020) studied neural Actor–Critic methods where the actor updates using (1) vanilla policy gradient or (2) natural policy gradient, and in both cases, the critic updates using TD(0). They proved that in case (1), the algorithm converges to a stationary point at a sublinear rate and they also established the global optimality of all stationary points under mild regularity conditions.…”

Section: Deep Reinforcement Learningmentioning

confidence: 99%

See 2 more Smart Citations

Recent advances in reinforcement learning in finance

Hambly

Yang

2023

Mathematical Finance

View full text Add to dashboard Cite

The rapid changes in the finance industry due to the increasing amount of data have revolutionized the techniques on data processing and data analysis and brought new theoretical and computational challenges. In contrast to classical stochastic control theory and other analytical approaches for solving financial decision‐making problems that heavily reply on model assumptions, new developments from reinforcement learning (RL) are able to make full use of the large amount of financial data with fewer model assumptions and to improve decisions in complex financial environments. This survey paper aims to review the recent developments and use of RL approaches in finance. We give an introduction to Markov decision processes, which is the setting for many of the commonly used RL approaches. Various algorithms are then introduced with a focus on value‐ and policy‐based methods that do not require any model assumptions. Connections are made with neural networks to extend the framework to encompass deep RL algorithms. We then discuss in detail the application of these RL algorithms in a variety of decision‐making problems in finance, including optimal execution, portfolio optimization, option pricing and hedging, market making, smart order routing, and robo‐advising. Our survey concludes by pointing out a few possible future directions for research.

show abstract

“…We parameterize the policy π by a neural network F with parameter

\theta =(\pmb {W}, \pmb {b})

, that is,

a\sim \pi (s,a;\theta )=f(F((s,a);\theta ))

for some function f . A popular choice of f is given by

\begin{equation*} f(F((s,a);\theta ))=\frac{\exp (\tau F((s,a);\theta ))}{\sum _{a^\prime \in \mathcal {A}}\exp (\tau F((s,a^\prime );\theta ))}, \end{equation*}

for some parameter τ, which gives an energy‐based policy (see, e.g., Haarnoja et al., 2017; Wang et al., 2020). The policy parameter θ is updated using the gradient ascent rule given by