“…The prevalent paradigm for training federated learning (FL) models [Konečný et al, 2016b,a, McMahan et al, 2017 (see also the recent surveys by Kairouz et al [2019], Li et al [2020a]) is to use distributed first-order optimization methods employing one or more tools for enhancing communication efficiency, which is a key bottleneck in the federated setting. These tools include communication compression [Konečný et al, 2016b, Alistarh et al, 2017, Khirirat et al, 2018 and techniques for progressively reducing the variance introduced by compression [ Mishchenko et al, 2019, Horváth et al, 2019, Gorbunov et al, 2020a, Li et al, 2020b, Gorbunov et al, 2021a, local computation [McMahan et al, 2017, Stich, 2020, Khaled et al, 2020, Mishchenko et al, 2021a and techniques for reducing the client drift introduced by local computation [Karimireddy et al, 2020, Gorbunov et al, 2021b, and partial participation [McMahan et al, 2017, Gower et al, 2019 and techniques for taming the slow-down introduced by partial participation [Gorbunov et al, 2020a, Chen et al, 2020.…”