2023
DOI: 10.48550/arxiv.2301.02184
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

Abstract: Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people ("egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help un… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 58 publications
(97 reference statements)
0
1
0
Order By: Relevance
“…Audio-visual spatial correspondence learning Learning the spatial alignment between video and audio is important for self-supervision [77,50,75,66], spatial audio generation [51,21,63,7,45], audio-visual embodied learning [8,44,46,9] and 3D scene mapping [62,47]. However, these methods are either restricted to exocentric settings [51,77,50,21,63,66,7], or else tackle egocentric settings [46,45,9,47] in simulated 3D environments that lack realism and diversity, both in terms of the audio-visual content of the videos and the continuous camera motion due to the camera-wearer's physical movements. On the contrary, we learn an audio-visual representation from real-world egocentric video.…”
Section: Related Workmentioning
confidence: 99%
“…Audio-visual spatial correspondence learning Learning the spatial alignment between video and audio is important for self-supervision [77,50,75,66], spatial audio generation [51,21,63,7,45], audio-visual embodied learning [8,44,46,9] and 3D scene mapping [62,47]. However, these methods are either restricted to exocentric settings [51,77,50,21,63,66,7], or else tackle egocentric settings [46,45,9,47] in simulated 3D environments that lack realism and diversity, both in terms of the audio-visual content of the videos and the continuous camera motion due to the camera-wearer's physical movements. On the contrary, we learn an audio-visual representation from real-world egocentric video.…”
Section: Related Workmentioning
confidence: 99%