“…However, large vocabulary, robust speech recognisers require broad contextual windows that span multiple words in order to exploit language models which improve decoding accuracy [20]. This introduces a significant time lag that makes real-time speech animation prohibitive, where tolerable delays are of the order 200ms [21], [22]. Conversely, audio-driven speech animation has been shown to generate realistic lip synchronisation in real-time, but in a speaker-dependent setting [2], [22].…”