“…Shared test collections are pervasive in well-known evaluation campaigns, such as TREC (Voorhees & Harman, 2005), or NTCIR (Kando, Sakai, & Sanderson, 2016). Furthermore, research teams sometimes need to build their own testbeds, for instance, to evaluate retrieval algorithms in specific domains (Balog & Neumayer, 2013;Losada & Crestani, 2016). However, creating an IR test collection is expensive and time-consuming.…”