“…Motivated by the bimodal nature of human speech perception [2,10], and the invariance of visual information to acoustic signal corruption, audio-visual speech recognition (AVSR) technologies [11,12,13,14] can also be used for overlapped speech separation [15,16,17,18,19,20,21,22] and the back-end recognition component. However, the use of visual modality in the recognition stage of system development for overlapped speech remains limited to date.…”