“…However, learning repre-* authors contributed equally sentations for these spoken utterances is a complex research problem due to the presence of multiple heterogeneous sources of information (Baltrušaitis et al, 2017). This challenging yet crucial research area has real-world applications in robotics (Montalvo et al, 2017;Noda et al, 2014), dialogue systems (Johnston et al, 2002;Rudnicky, 2005), intelligent tutoring systems (Mao and Li, 2012;Banda and Robinson, 2011;Pham and Wang, 2018), and healthcare diagnosis (Wentzel and van der Geest, 2016;Lisetti et al, 2003;Sonntag, 2017). Recent progress on multimodal representation learning has investigated various neural models that utilize one or more of attention, memory and recurrent components (Yang et al, 2017;.…”