Background/PurposeThere is an urgent need to identify effective biomarkers for early diagnosis of Rheumatoid Arthritis (RA) and accurate monitoring of disease activity. Here we define a RA meta-profile using publicly available cross-tissue gene expression data and apply machine learning to identify putative biomarkers, which we further validate on independent datasets.MethodsWe carried out a comprehensive search for publicly available microarray gene expression data in the NCBI Gene Expression Omnibus database for whole blood and synovial tissues from RA patients and healthy controls. The raw data from 13 synovium datasets with 284 samples and 14 blood datasets with 1,885 samples were downloaded and processed. The datasets for each tissue were merged and batch corrected and split into training and test sets. We then developed and applied a robust feature selection pipeline to identify genes dysregulated in both tissues and highly associated with RA. From the training data we identified a set of overlapping differentially expressed genes following the condition of co-directionality. The classification performance of each gene in the resulting set was evaluated on the testing sets using AUROC. Five independent datasets were used to validate and threshold the feature selected (FS) genes. Finally, we define the RAScore, composed of a geometric mean of the selected RAScore Panel genes and demonstrate its clinical utility.ResultsThe result of the feature selection pipeline was a set of 25 upregulated and 28 downregulated genes. To assess the robustness of these feature selected genes, we trained a Random Forest machine learning model with this set of 53 genes and then with the set of 32 common differentially expressed genes and tested on the validation cohorts. The model with FS genes outperformed the model with common DE genes with AUC 0.89 ± 0.04 vs 0.86 ± 0.05. The FS genes were further thresholded on the 5 independent datasets resulting in 10 upregulated genes, TNFAIP6, S100A8, TNFSF10, DRAM1, LY96, QPCT, KYNU, ENTPD1, CLIC1, ATP6V0E1, that are involved in innate immune system pathways, including neutrophil degranulation and apoptosis and expressed in granulocytes, dendritic cells, and macrophages; and 3 downregulated genes, HSP90AB1, NCL, CIRBP, involved in metabolic processes and T-cell receptor regulation of apoptosis and expressed in lymphoblasts.To investigate the clinical utility of the 13 validated genes, the RA Score was developed and found to be highly correlated with DAS28 (r = 0.33 ± 0.03, p = 7e-9) and able to distinguish OA and RA samples (OR 0.57, 95% CI [0.34, 0.80], p = 8e-10). Moreover, the RA Scores were not significantly different for RF-positive and RF-negative RA sub-phenotypes (p = 0.9) suggesting the generalizability of this score in clinical applications. The RA Score was also able to monitor the treatment effect among RA patients (t-test of treated vs untreated, p = 2e-4) and distinguish polyJIA from healthy individuals in 10 independent pediatric cohorts (OR 1.15, 95% CI [1.01, 1.3], p = 2e-4).ConclusionThe RAScore, consisting of 13 putative biomarkers, identified through a robust feature selection procedure on public data and validated using multiple independent data sets may be useful in the diagnosis and treatment monitoring of RA.