“…With the advent of deep learning ( LeCun et al, 2015 ; Goodfellow et al, 2016 ), it was possible to learn visual features characterising the task directly from raw RGB videos. The features are extracted from raw videos using a variety of methods: deep metric learning ( Sermanet et al, 2018 ), generative adversarial learning ( Stadie et al, 2017 ), domain translation ( Liu et al, 2018 ; Smith et al, 2019 ; Sharma et al, 2019 ), transfer learning ( Sharma et al, 2018 ; Sermanet et al, 2017 ), action primitives ( Jia et al, 2020 ), predictive modelling ( Tow et al, 2017 ), video to text translation ( Yang et al, 2019 ), meta-learning and ( Yu et al, 2018a ; Yu et al, 2018b ). A comparison of these methods is given in the Table 1 and a detailed study can be found in ( Pauly, 2021 ).…”