“…Initialize replay memory D with capacity N Initialize Q-network with random weights θ Initialize target Q-network with weights θ_target = θ Initialize offloading environment for episode = 1, M do Initialize state s for t = 1, T_max do Choose action a from state s using ε-greedy policy 'Execute action a and observe reward r and next state s' 'Store transition (s, a, r, s') in replay memory D' 'Sample random mini-batch of transitions (sj, aj, rj, s'j) from D' Compute target Q-values: if s' is the terminal state then target = rj else target = rj + γ * max(Q_target(s'_j, a', θ_target)) Update Q-network parameters θ by minimizing the loss: loss = 1/N * Σ(target -Q(sj, aj, θ)) 2 θ = θ -α * ∇_θ(loss) For every C steps, update target Q-network: θ_target = θ if s' is the terminal state then Break else s = s' Every episode, evaluate performance and monitor convergence end for M denotes the total number of episodes. T_max represents the maximum number of steps per episode.…”