Overall, both the use of the DDSM as data source for evaluation of mammography CAD systems, and the application of statistical evaluation methods were found highly diverse. Results reported from different studies are therefore hardly comparable. Drawbacks of the DDSM (e.g. varying quality of lesion annotations) may contribute to the reasons. But larger bias seems to be caused by authors' own decisions upon study design. RECOMMENDATIONS/CONCLUSION: For future evaluation studies, we derive a set of 13 recommendations concerning the construction and usage of a test database, as well as the application of statistical evaluation methods.