“…In Figure 3, the reason that the cost evolution of OPB is the same as that of the optimal policy (middle) is that in this case, the cost of the best arm (arm 4) is equal to the constraint threshold τ = .2. As described in Section 1, our setting is the closest to the one studied by Amani et al [2019] and Moradipari et al [2019]. They study a slightly different setting, in which the mean cost of the action that the agent takes should satisfy the constraint, i.e., x t , µ * ≤ τ , not the mean cost of the policy it computes, i.e., x πt , µ * ≤ τ , as in our case.…”