“…In the latter, the sounds are taken from internet video and thus contain a much wider range of auditory events than what we consider in this work. Later work simultaneously learned audio and visual representations [26,27,28,29,30,31]. Other work has learned cross-modal distillation [32], sound source localization [33,34,35,27,36,37,38,39,40], active speaker detection [41,42,43], source separation [44,45,46,47].…”