“…Joint training of Intent Recognition and Entity Extraction models have been explored recently (Zhang and Wang, 2016;Liu and Lane, 2016;Goo et al, 2018;Varghese et al, 2020). Several hierarchical multi-task architectures are proposed for these joint NLU approaches (Zhou et al, 2016;Wen et al, 2018;Okur et al, 2019;Vanzo et al, 2019), few of them in multimodal context (Gu et al, 2017;Okur et al, 2020). Vaswani et al (2017) proposed the Transformer as a novel neural network architecture based entirely on attention mechanisms (Bahdanau et al, 2015).…”