“…Furthermore, the states with higher d and larger h are visited less often as evident from Figure 2. Because, if h � 800 Mbits and d � 10 mint at the (1) Initialization (2) π(a|s) as a random uniform policy (3) Q(s, a)⟵Ω(s, a) + ∀a i ∀s ′ π(a i |s)Q(s ′ , a i ) (4) β(s, a)⟵0 ∀s ∈ S, a ∈ A (5) for each download request μ o (ψ, D) -episode do (6) defne state s(k, h, d) -d � D, h � ψ, k randomly generated using (6) (7) while download is not complete (h > 0) do (8) if Q(s, ∀a) is same then (9) choose action a at random (10) else (11) choose a � argmin a Q(s, a) (12) end if (13) take action a (14) update c(s, a) by using ( 13) (15) if d > 0 then (16) obtain Ω(s, a) using ( 9) (17) obtain We have reported the data ofoading policy learned by the Q-agent in Figure 3, that is, the optimal actions taken by the Q-agent in the same states as mentioned in Figure 2. We have included data for only three most important locations wherein the decision-making is challenging just to give better insights.…”