“…Several AVSR architectures have been proposed [4,10,17,13,16,22,23] which show that the improvement over ASR models is greater as the noise level increases, i.e., the SNR is lower. The same VSR architectures can also be used to improve the performance of audio-based models in a variety of applications like speech enhancement [24], speech separation [25,26], voice activity detection [27], active speaker detection [28] and speaker diarisation [29].…”