“…1) when the network file is ready to be read, i.e. the DQL core is not writing on the file, the DQL middleware loads the weights of the network and the value of ε from the network file, and the simulator starts a new episode; 2) the simulator sends a message to the controller containing the current scenario specifications; 3) the DQL middleware processes the message to identify the current state s t and collects the reward r t −1 that refers to the state change from s t −1 to s t caused by the past action a t −1 ; then, it composes a new tuple s t −1 , a t −1 , r t −1 , s t ; 4) if the episode fails, the DQL middleware saves the tuple in RB rare, the simulation ends and the process restarts from step 1); otherwise, it saves the tuple in RB common; 5) the DQL middleware calls the network with the current state s t to estimate the rewards of all possible actions and chooses the next a t using the ε-greedy policy; 6) the DQL middleware computes the new value of RCS, v using equation (8) and sends it to the ADS. 7) the process restarts from step 1) until the training is completed.…”