Traditional "ad hoc" test collections, typically built based on depth-100 pools, are often used a posteriori by non-contributors, i.e., research groups that did not contribute the pools. The Leave One Out (LOO) test is useful for testing whether the test collections are actually reusable: that is, whether the non-contributors can be evaluated fairly relative to the contributors' official performances. In contrast, at the recent web search result diversification tasks of TREC and NTCIR, diversity test collections have been built using shallow pools: the pool depths lie between 20 and 40. Thus it is unlikely that these diversity test collections are reusable: in fact, the organisers of these diversity tasks never claimed that they are. Nevertheless, these collections are also used a posteriori by non-contributors. In light of this, Sakai et al. [21] demonstrated by means of LOO tests that the NTCIR-9 INTENT-1 Chinese diversity test collection is not reusable, and also showed that condensed-list evaluation metrics generally provide better estimates of the noncontributors' true performances than raw evaluation metrics. This paper generalises and strengthens their findings through LOO tests with the latest TREC 2012 diversity test collection.