“…Baselines: In this work, we perform experiments on the benchmarks above with the following fixed capacity methods and an expansion-based method for comparison: (1) SGD which uses stochastic gradient descent optimizing procedure to finetune the model, (2) EWC (Kirkpatrick et al, 2017) which is one of the pioneering regularization methods using fisher information diagonals as important weights, (3) A-GEM (Chaudhry et al, 2019a) which uses loss gradients of stored previous data in an in-equality constrained optimization, (4) LOS (Chaudhry et al, 2020) which constraints gradients in a low-rank orthogonal subspace, (5) ER-ring (Chaudhry et al, 2019b) which utilizes a tiny ring memory to alleviate forgetting, (6) GPM (Saha et al, 2021) which trains new tasks in the residual gradient subspace, (7) APD (Yoon et al, 2019) which is a strong expansion-based method decomposing the parameters of different tasks with a common basis, and (8) STL which trains a model for each single task. For the compared methods, we follow the original implementations to perform some necessary processing at the end of every task.…”