“…The main contributions of this paper are summarized below: First, to the best of our knowledge, this paper presents the first use of a complete audio-visual multi-channel speech separation, dereverberation and recognition system architecture featuring a full incorporation of visual information into all three stages. In contrast, prior researches incorporate video modality in either only the speech enhancement front-end [25,26,28], recognition back-end [30][31][32][33], or both multi-channel speech separation and recognition stages [35,36] but excluding the dereverberation component. Second, a more complete experimental validation of the advantage of audio-visual versus audio only dereverberation approaches of multiple forms (DNN-WPE, spectral mapping) is presented, as previous research [37] only considered the spectral mapping method.…”