“…We study the common continual learning paradigm in which pre-training precedes continual learning [28,29,55,10,71,32,33,34,68,4]. Formally, given a pre-training dataset {(X i , y i )} ∈ A, with M images X i and their corresponding labels y i ∈ Y, a set of parameters θ are learned for a CNN using A in an offline manner, i.e., the learner can shuffle the data to simulate independent and identically distributed data and loop over it as many times as it desires.…”