“…Second, action recognition models in the literature rely on computer-vision based approaches to analyze 2D videos recorded by an egocentric camera, e.g., (Fathi et al, 2011(Fathi et al, , 2012Fathi and FIGURE 9 | Point clouds of the four activity-relevant objects involved in Activity 1 were segmented into multiple regions for finer spatial resolution: (A) pitcher, (B) pitcher lid, (C) spoon, and (D) mug. Rehg, 2013;Matsuo et al, 2014;Soran et al, 2015;Ma et al, 2016;Li et al, 2018;Furnari and Farinella, 2019;Sudhakaran et al, 2019;Liu et al, 2020). Whether using hand-crafted features (Fathi et al, 2011(Fathi et al, , 2012Fathi and Rehg, 2013;Matsuo et al, 2014;Soran et al, 2015;Ma et al, 2016;Furnari and Farinella, 2019) or learning end-to-end models (Li et al, 2018;Sudhakaran et al, 2019;Liu et al, 2020), the computer vision-based approaches to action recognition must also address the challenges of identifying and tracking activity-relevant objects.…”