The goal in user interfaces is natural interactivity unencumbered by sensor and display technology. In this paper, we propose that a multi-modal approach using inverse modeling techniques from computer vision, speech recognition, and acoustics can result in such interfaces. In particular, we demonstrate a system for audiovisual tracking, showing that such a system is more robust, more accurate, more compact, and yields more information than using a single modality for tracking. We also demonstrate how such a system can be used to find the talker among a group of individuals, and render 3D scenes to the user.