“…Although the analysis of hand-object interactions mostly involves bounding box annotations, a few works have focused on studying hand-object relations using semantic segmentation mask annotations (González-Sosa et al, 2021;Zhang et al, 2022a;Darkhalil et al, 2022;Tokmakov et al, 2023). These works focus on hands and active objects semantic seg-mentation considering egocentric images (González-Sosa et al, 2021;Zhang et al, 2022a) or videos (Darkhalil et al, 2022;Tokmakov et al, 2023). Darkhalil et al (2022) defined and predicted hand-object relations, including cases where the on-hand glove is in contact with an object in the environment.…”
Section: State-of-the-art Papersmentioning
confidence: 99%
“…Due to the massive-scale and unconstrained nature of Ego4D, it has proved to be useful for various tasks including action recognition (Liu et al, 2022a;Lange et al, 2023), action detection (Wang et al, 2023a), visual question answering (Bärmann & Waibel, 2022), active speaker detection (Wang et al, 2023d), natural language localisation , natural language queries (Ramakrishnan et al, 2023), gaze estimation (Lai et al, 2022), persuasion modelling for conversational agents (Lai et al, 2023b), audio visual object localisation (Huang et al, 2023a), hand-object segmentation (Zhang et al, 2022b) and action anticipation (Ragusa et al, 2023a;Pasca et al, 2023;Mascaró et al, 2023). New tasks have also been introduced thanks to the diversity of Ego4D, e.g.…”
What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.
“…Although the analysis of hand-object interactions mostly involves bounding box annotations, a few works have focused on studying hand-object relations using semantic segmentation mask annotations (González-Sosa et al, 2021;Zhang et al, 2022a;Darkhalil et al, 2022;Tokmakov et al, 2023). These works focus on hands and active objects semantic seg-mentation considering egocentric images (González-Sosa et al, 2021;Zhang et al, 2022a) or videos (Darkhalil et al, 2022;Tokmakov et al, 2023). Darkhalil et al (2022) defined and predicted hand-object relations, including cases where the on-hand glove is in contact with an object in the environment.…”
Section: State-of-the-art Papersmentioning
confidence: 99%
“…Due to the massive-scale and unconstrained nature of Ego4D, it has proved to be useful for various tasks including action recognition (Liu et al, 2022a;Lange et al, 2023), action detection (Wang et al, 2023a), visual question answering (Bärmann & Waibel, 2022), active speaker detection (Wang et al, 2023d), natural language localisation , natural language queries (Ramakrishnan et al, 2023), gaze estimation (Lai et al, 2022), persuasion modelling for conversational agents (Lai et al, 2023b), audio visual object localisation (Huang et al, 2023a), hand-object segmentation (Zhang et al, 2022b) and action anticipation (Ragusa et al, 2023a;Pasca et al, 2023;Mascaró et al, 2023). New tasks have also been introduced thanks to the diversity of Ego4D, e.g.…”
What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.
“…Hand-object grasp reconstruction also employs contact to refine the hand and object pose estimation [5,15,20,52,54]. In addition, some works [36,47,62] detect hands and classify their physical contact state into self-contact, person-person contact, and person-object contact. Although they consider the relationship between hands and other objects in the scene, they detect only a rough bounding box or boundary for the hand, instead of a finer-grained contact area.…”
Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts in images. To build HOT, we use two data sources:(1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons around the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task, that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability. Our HOT data and model are available for research at https://hot.is.tue.mpg.de.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.