“…• We show that multi-modal self-supervision, applied to both source and unlabelled target data, can be used for Figure 2: Fine-grained action datasets [8,17,26,28,38,42,46,47,50], x-axis: number of action segments per environment (ape), y-axis: dataset size divided by ape. EPIC-Kitchens [8] offers the largest ape relative to its size.…”