“…Thus, routing can be readily modeled as a Markov decision process [7] and naturally fits into the realm of reinforcement learning. In this direction, many previous works [8,9,10,11,12,13,14,15,16] have employed the classical Q-learning [17] algorithm to train agents to find the optimal route. In these works, a distinct agent is associated with each transmission node.…”