“…Since the dimensionality of θ θ θ can be much smaller than |S||A|, the DQN is efficiently trained with few experiences, 1 Softmax is a function that takes as input a vector z ∈ R F , and normalizes it into a probability distribution via σ(z) Take action a a a n using local policy 7 a a a n (t, τ ) = π n (s s s(t − 1, τ )) if t = 1 π n (s s s(T, τ − 1)) if t = 1 8 Requests r r r n (t, τ ) are revealed 9 Set s s s n (t, τ ) = r r r n (t, τ ) Find ∇L Tar (θ) for these samples, using (20) 23 Update θ τ +1 = θ τ − β τ ∇L Tar (θ) 24 If mod(τ, C) = 0, then update θ θ θ Tar = θ θ θ τ 25 end and generalizes to unseen state vectors. Unfortunately, DQN model inaccuracy can propagate in the cost prediction error in (16) that can cause instability in (18), which can lead to performance degradation, and even divergence [43], [44].…”