“…Recently, policy optimization (PO) has seen great success in many real-world applications, especially when coupled with deep neural networks (Silver et al, 2017;Duan et al, 2016;Wang et al, 2018), and a variety of PO based algorithms have been proposed (Williams, 1992;Kakade, 2001;Schulman et al, 2015Schulman et al, , 2017Konda and Tsitsiklis, 2000). The theoretical understandings of PO have also been studied in both computational (i.e., convergence) perspective (Liu et al, 2019;Wang et al, 2019a) and statistical (i.e., regret) perspective Efroni et al, 2020a). Thus, one fundamental question to ask is how to build on existing understandings of non-private PO algorithms to design sample-efficient policy-based RL algorithms with general privacy guarantees (e.g., JDP and LDP), which is the main motivation behind this work.…”