The complicated, evolving landscape of cancer mutations poses a formidable challenge to identify cancer genes among the large lists of mutations typically generated in NGS experiments. The ability to prioritize these variants is therefore of paramount importance. To address this issue we developed OncoScore, a text-mining tool that ranks genes according to their association with cancer, based on available biomedical literature. Receiver operating characteristic curve and the area under the curve (AUC) metrics on manually curated datasets confirmed the excellent discriminating capability of OncoScore (OncoScore cut-off threshold = 21.09; AUC = 90.3%, 95% CI: 88.1-92.5%), indicating that OncoScore provides useful results in cases where an efficient prioritization of cancer-associated genes is needed.The huge amount of data emerging from NGS projects is bringing a revolution in molecular medicine, leading to the discovery of a large number of new somatic alterations that are associated with the onset and/or progression of cancer. However, researchers are facing a formidable challenge in prioritizing cancer genes among the variants generated by NGS experiments. Despite the development of a significant number of tools devoted to cancer driver prediction, limited effort has been dedicated to tools able to generate a gene-centered Oncogenic Score based on the evidence already available in the scientific literature. To overcome these limitations, we propose here OncoScore, a bioinformatics text-mining tool capable of automatically scanning the biomedical literature by means of dynamically updatable web queries and measuring gene-specific cancer association in terms of gene citations. The output of this analysis is a score representing the strength of the association of any gene symbol to cancer, based on the literature available at the time of the analysis. OncoScore is distributed as a R Bioconductor package (https://bioconductor.org/packages/release/bioc/html/OncoScore.html) in order to allow full customization of the algorithm and easy integration in existing NGS pipelines, and as a web tool for easy access by researchers with limited or no experience in bioinformatics (http://www.galseq.com/oncoscore.html).
ResultsWe analyzed the performance of OncoScore on the Cancer Genes Census (CGC; Supplementary Table 1), a collection of regularly updated and manually annotated genes accepted as causally implicated in oncogenesis 1 . To assess the ability of OncoScore to discriminate between cancer and non-cancer genes we generated the OncoScore estimation for the whole CGC dataset and for a manually curated list of genes not associated with cancer (named nCan; Supplementary Table 2; see Methods section for further details). Genes with a total citation count < 10 publications were filtered out, therefore from a total of 507 CGC and 302 nCan, 472 (93.1%) and 266 (88.1%) genes were further processed.OncoScore performance. The distribution of OncoScore values differed significantly between the two groups (mean: 48.8 and 14.8 for C...