“…There are two proposals for momentum parameter to guarantee convergence, but completely opposite. Specifically, one assumes a strictly monotonically decreasing schedule (β 1,t−1 > β 1,t , β 1,t → 0) (Kingma and Ba 2015) (Reddi, Kale, and Kumar 2018) (Wang et al 2020) (Zhuang et al 2020) while the other demands an increasing schedule (β 1,t−1 < β 1,t , β 1,t → 1) (Ghadimi, Feyzmahdavian, and Johansson 2015) (Yang, Lin, and Li 2016) (Tao et al 2021) (Li, Liu, and Orabona 2022), which not only creates a theory-practice gap but also causes confusion when selecting hyper-parameters in practice. On the other hand, (Ghadimi, Feyzmahdavian, and Johansson 2015) utilizes a constant momentum parameter, which guarantees the convergence of HB, but relies on strong assumptions of strong convexity and smoothness in the objective function.…”