Over the course of 19 months, West Virginia University collected reports from 70 footwear experts, each performing 12 questioned-test comparisons, resulting in a dataset that includes more than 1000 examiner attributes (education, training, certification status, etc.), 3500 impression features identified and evaluated (clarity, totality, and similarity), and 840 source conclusions. The results were used to estimate the performance of forensic footwear examiners in the United States, including error rates, predictive value (PV), and measures of interrater reliability (IRR). For the dataset and mate-prevalence (31.5%) used in this study, results indicate correct predictive value varies from 94.5% for exclusions, 85.0% for identifications, and between 70.1% and 65.2% for limited associations and association of class, respectively (with all other conclusions producing PVs between these extremes). After data transformation based on ground truth, the case study materials show a false-positive rate of 0.48%, a false-negative rate of 15.6%, a (correct) positive predictive value of 98.8%, and a (correct) negative predictive value of 93.3%. In addition to error rates and PVs, inter-rater reliability was likewise computed to describe examiner reproducibility; results indicate a Gwet AC 2 agreement coefficient of 0.751-0.692 when using a six-and four-level reporting structure, respectively, which translates into "substantial" and "moderate agreement" for a benchmarked verbal equivalent scale. The reported performance metrics are further compared against past forensic footwear reliability studies, including a discussion of how the use of a six-level reporting structure impacts results.