“…In the video domain, a common multi-modal paradigm involves combining representations from both visual and audio features [4,7,21,32,33,36,48]. Such representations have attracted the interest of the com-puter vision community, as they allow exploring new approaches to well established problems, such as person reidentification [32,24,54], audio-visual synchronization [1,8,9], speaker diarization [43,47,58], bio-metrics [33,39], and audio-visual source separation [4,21,36,40,48]. Active speaker detection is a special instance of audiovisual source separation, where sources are the visible persons in a video, and the goal is to detect and assign a segment of speech to one of those candidates.…”