“…While we are inspired by these advances, they also have certain limitations. Often systems will emit sounds (e.g., a frequency sweep) into the environment to ping for spatial information [1,14,15,24,28,44,59,69], which is intrusive if done around people. Furthermore, existing audio-visual models assume that the camera is always on grabbing new frames, which is wasteful if not intractable, particularly on lightweight, low-power computing devices in AR settings.…”