“…The former assumes that only the labeled videos from the seen categories are available during training while the latter can use the unlabeled data of the unseen categories for model training. Specifically, in this work, we focus on inductive ZSAR [12], [15], [26], [42] and do not discuss the transductive approach [9], [32].…”