“…There are works that, given vision, enhance sounds [30,18], fill in missing sounds [42], and generate sounds entirely from video [32,43]. Further, there have been recent works in integrating vision and sound to improve recognition of environmental properties [3,21,8] and object properties, such as geometry and materials [40,39]. Lastly, there have been works in using audiovisual data for representation learning [33,4,28].…”