“…At each instant n, the action probability vector pi(n) is updated by the linear learning algorithm given in equation ( 13) if the chosen action ai(k) is rewarded by the environment, and it is updated according to equation ( 14) if the chosen action is penalized [104]. [11], [12], [13] Global problem [26], [27], [28] Healthcare [32], [33], [34], [35], [36], [41], [98], [37], [39], [40], [42], [43], [45], [46], [47], [48], [49], [50], [51], Industrial [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62] Network [63], [67], [68], [69], [99], [100] Physics [71], [72] Text processing …”