“…If the F1 overlap is less than 0.5 we drop the example, leaving 281,658 examples out of the original 808,731. For NQ, three different settings are used: with all documents as input, with only the gold document, and with a sampled dialogue history context, fol- Question Answering MS MARCO (Nguyen et al, 2016) SQuAD (Rajpurkar et al, 2016) TriviaQA (Joshi et al, 2017) Natural Questions (Kwiatkowski et al, 2019) Natural Questions (Open) Natural Questions (Open Dialogues) (Adolphs et al, 2021) Knowledge-Grounded Dialogue Wizard of the Internet (Komeili et al, 2022) Wizard of Wikipedia (Dinan et al, 2019b) Funpedia (Dinan et al, 2020b) Open-Domain Dialogue PersonaChat (Zhang et al, 2018) Empathetic Dialogues (Rashkin et al, 2019) Blended Skill Talk (Smith et al, 2020) Multi-Session Chat (Xu et al, 2022a) LIGHT + WILD (Urbanek et al, 2019;Shuster et al, 2021b) Recovery & Feedback SaFeRDialogues (Ung et al, 2022) FITS (Xu et al, 2022b) Task-Oriented Dialogue Google SGD (Rastogi et al, 2020) Taskmaster (Byrne et al, 2019) Taskmaster 2 (Byrne et al, 2019) Taskmaster 3 (Byrne et al, 2019) Table 2: Details of all the training datasets used for fine-tuning the modular tasks.…”