“…An improved trade-off between the speech enhancement front-end loss function and ASR accuracy can then be obtained, for example, using multitask learning [67], [78], [79]. To date, such joint speech enhancement front-end and ASR back-end optimization has been only conducted among: a) audio-only speech enhancement and recognition systems using no video input [19], [23], [72], [78], [80]- [82]; or b) audio-visual speech separation and recognition tasks only while not considering speech dereverberation [67], [79]. Hence, there is a pressing need to derive suitable joint optimization methods for a complete audio-visual multi-channel speech separation, dereverberation and recognition system.…”