BackgroundElectronic Health Records (EHRs) offer a wealth of observational data. Machine-Learning (ML) methods are efficient at data extraction, capable of processing the information-rich free-text physician notes in EHRs. The clinical diagnosis contained therein represents physician expert opinion and is more consistently recorded than classification criteria components.ObjectivesTo investigate the overlap and differences between Rheumatoid Arthritis patients as identified either from EHR free-text through extraction of the rheumatologist diagnosis using machine-learning (ML) or through manual chart-review applying the 1987 and 2010 RA classification criteria.MethodsSince EHR initiation, 17,662 patients visited the Leiden outpatient clinic. For ML, we used a Support Vector Machine (SVM) model to identify those who were diagnosed with RA by their rheumatologist. We trained & validated the model on a random selection of 2,000 patients, balancing PPV and Sensitivity to define a cutoff, and assessed performance on a separate 1,000 patients. We then deployed the model on our entire patient selection (including the 3,000). Of those, 1,212 patients had both a 1987 and 2010 EULAR/ACR criteria status at one year after inclusion into the local prospective arthritis cohort. In these 1,212 patients we compared the patient characteristics of RA cases identified with ML and those fulfilling the classification criteria. ResultsThe ML model performed very well in the independent test set (sensitivity=0.85, specificity=0.99, PPV=0.86, NPV=0.99). In our selection of patients with both EHR and classification information, 406 were recognized as RA by ML and 386 and 457 fulfilled the 1987 or 2010 criteria respectively. Eighty percent of the ML-identified cases fulfilled at least one of the criteria sets. Both demographic and clinical parameters did not differ between the ML extracted cases and those identified with EULAR/ACR classification criteria. ConclusionsWith ML methods we enable fast patient extraction from the huge EHR resource. Our ML algorithm accurately identifies patients diagnosed with RA by their rheumatologist. This resulting group of RA patients had a strong overlap with patients identified using the 1987 or 2010 classification criteria and the baseline (disease) characteristics were comparable. ML assisted case labelling enables high-throughput creation of inclusive patient selections for research purposes.