We read with some interest the recent papers in Statistics in Medicine, and associated commentaries, on Net Reclassification Improvement (NRI) and Integrated Discrimination Improvement (IDI). 1,2 It first struck us as somewhat incongruous to devote so much journal space to measures that have clearly been shown to have a host of undesirable properties, such as being grossly anticonservative and favoring overfit models. 3,4 The new version of the NRI, the NRI(p), is said to have good properties because at the event rate, it is equivalent to net benefit. So why not just use net benefit? Pencina et al state that in order to use net benefit, there need to be "wellestablished thresholds," and in their "vast and varied experience," this is "rare." 5 If the authors are implying that there is often no single threshold, this is true, but irrelevant because net benefit is traditionally estimated across a range of thresholds. If the authors are implying that it is the suitable range of thresholds that is unknown, this is obviously false. It would suggest, for example, that for most models, users have no idea how to interpret the resulting predicted probabilities. We note, for example, in the paper that introduced the NRI and IDI, Pencina et al themselves used thresholds of 6% and 20% for cardiovascular risk. 6 This paper made no reference to the motivating example being a "rare" situation due to the availability of well-established thresholds. NRI(p) is proposed as a summary measure, but net benefit at a threshold chosen with respect to clinical consequences would be preferable. As Kerr et al show, 7 it is trivial to come up with examples showing that NRI(p) inappropriately selects between two markers. Take, for instance, potentially fatal and highly prevalent disease for which there is a safe and in expensive drug therapy. A highly specific marker would have superior NRI(p) (i.e., net benefit at the event rate, which is high), but in practice, we would prefer a sensitive marker because there is a premium on finding disease. This scenario would be avoided if a clinically relevant threshold was used in place of the event rate.The requirement of good calibration is also highly problematic ("need to ascertain good calibration before examining the predictive performance of a new model" 1 ). What level of calibration counts as "good"? The Hosmer-Lemeshow test is not going to help because it may give a lower p value to a very large study demonstrating a small amount of miscalibration than a small study with findings of important miscalibration.So we have a statistical approach that with some obvious drawbacks, but which might be useful in the space (size undefined) where there is no information on relevant thresholds (no examples given) and models have to be well calibrated (no criteria given). We challenge the proponents of IDI and NRI to provide nontrivial examples where an NRI or IDI statistic provides useful information over and above net benefit and standard metrics such as the concordance index. Continued promotion of and ac...