“…While recent years have shown great successes in speaker recognition [11,19,42,44], these successes have been reliant on the collection of large, labelled datasets such as VoxCeleb [12,35,36] and others [16,30]. The VoxCeleb datasets, while valuable, have been collected entirely from interviews of celebrities in YouTube videos and are limited in terms of linguistic content (celebrities mostly speak about their professions [33]), emotion, and background noise. In contrast, movies contain speech covering emotions such as anger, sadness, assertiveness, and fright, and varied background conditions -imagine the shouting in a violent scene from an action movie, or a romantic scene of reconciliation in a romcom.…”