2023
DOI: 10.2139/ssrn.4416411
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…We parameterize the policy π by a neural network F with parameter θ=false(W,bfalse)$\theta =(\pmb {W}, \pmb {b})$, that is, aπfalse(s,a;θfalse)=ffalse(Ffalse(false(s,afalse);θfalse)false)$a\sim \pi (s,a;\theta )=f(F((s,a);\theta ))$ for some function f . A popular choice of f is given by f(Ffalse((s,a);θfalse))badbreak=expfalse(τFfalse(false(s,afalse);θfalse)false)ascriptAexp(τFfalse((s,a);θfalse)),$$\begin{equation*} f(F((s,a);\theta ))=\frac{\exp (\tau F((s,a);\theta ))}{\sum _{a^\prime \in \mathcal {A}}\exp (\tau F((s,a^\prime );\theta ))}, \end{equation*}$$for some parameter τ, which gives an energy‐based policy (see, e.g., Haarnoja et al., 2017; Wang et al., 2020). The policy parameter θ is updated using the gradient ascent rule given by θfalse(n+1false)badbreak=θfalse(nfalse)goodbreak+βtrueθJ(θ(n))̂,2emngoodbreak=0,,Ngoodbreak−1,$$\begin{equation*} \theta ^{(n+1)} = \theta ^{(n)} + \beta \widehat{\nabla _{\theta }J(\theta ^{(n)})}, \qquad n=0,\ldots ,N-1, \end{equation*}$$where θJfalse(θfalse(nfalse)false)̂$\widehat{\nabla _{\theta }J(\theta ^{(n)})}$ is an estimate of the policy gradient.…”
Section: Deep Reinforcement Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…We parameterize the policy π by a neural network F with parameter θ=false(W,bfalse)$\theta =(\pmb {W}, \pmb {b})$, that is, aπfalse(s,a;θfalse)=ffalse(Ffalse(false(s,afalse);θfalse)false)$a\sim \pi (s,a;\theta )=f(F((s,a);\theta ))$ for some function f . A popular choice of f is given by f(Ffalse((s,a);θfalse))badbreak=expfalse(τFfalse(false(s,afalse);θfalse)false)ascriptAexp(τFfalse((s,a);θfalse)),$$\begin{equation*} f(F((s,a);\theta ))=\frac{\exp (\tau F((s,a);\theta ))}{\sum _{a^\prime \in \mathcal {A}}\exp (\tau F((s,a^\prime );\theta ))}, \end{equation*}$$for some parameter τ, which gives an energy‐based policy (see, e.g., Haarnoja et al., 2017; Wang et al., 2020). The policy parameter θ is updated using the gradient ascent rule given by θfalse(n+1false)badbreak=θfalse(nfalse)goodbreak+βtrueθJ(θ(n))̂,2emngoodbreak=0,,Ngoodbreak−1,$$\begin{equation*} \theta ^{(n+1)} = \theta ^{(n)} + \beta \widehat{\nabla _{\theta }J(\theta ^{(n)})}, \qquad n=0,\ldots ,N-1, \end{equation*}$$where θJfalse(θfalse(nfalse)false)̂$\widehat{\nabla _{\theta }J(\theta ^{(n)})}$ is an estimate of the policy gradient.…”
Section: Deep Reinforcement Learningmentioning
confidence: 99%
“…Using neural networks to parametrize the policy and/or value functions in the vanilla version of policy‐based methods discussed in Section 2.4 leads to neural Actor–Critic algorithms (Wang et al., 2020), neural PPO/TRPO (Liu et al., 2019), and deep DPG (DDPG) (Lillicrap et al., 2016). In addition, since introducing an entropy term in the objective function encourages policy exploration (Haarnoja et al., 2017) and speeds the learning process (Haarnoja et al., 2018; Mei et al., 2020) (as discussed in Section 2.5.4), there have been some recent developments in (off‐policy) soft Actor–Critic algorithms (Haarnoja et al., 2018), (Haarnoja et al., 2018) using neural networks, which solve the RL problem with entropy regularization.…”
Section: Deep Reinforcement Learningmentioning
confidence: 99%
See 1 more Smart Citation