2019
DOI: 10.1186/s13059-019-1634-2
|View full text |Cite
|
Sign up to set email alerts
|

NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans

Abstract: State-of-the-art methods assessing pathogenic non-coding variants have mostly been characterized on common disease-associated polymorphisms, yet with modest accuracy and strong positional biases. In this study, we curated 737 high-confidence pathogenic non-coding variants associated with monogenic Mendelian diseases. In addition to interspecies conservation, a comprehensive set of recent and ongoing purifying selection signals in humans is explored, accounting for lineage-specific regulatory elements. Supervis… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
64
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 57 publications
(80 citation statements)
references
References 71 publications
(123 reference statements)
2
64
0
Order By: Relevance
“…To optimize classification performance, we selected XGBoost parameter settings to minimize overfitting, as in ref. 58 .…”
Section: Classification Of Disease-associated or Fine-mapped Snpsmentioning
confidence: 99%
“…To optimize classification performance, we selected XGBoost parameter settings to minimize overfitting, as in ref. 58 .…”
Section: Classification Of Disease-associated or Fine-mapped Snpsmentioning
confidence: 99%
“…We note three key differences between AnnotBoost and previous approaches that utilized gradient boosting to identify pathogenic missense 7 and non-coding variants 9,10 . First, AnnotBoost uses a pathogenicity score as the only input and does not use disease data (e.g.…”
Section: Discussionmentioning
confidence: 98%
“…Second, AnnotBoost produces genome-wide scores, even when some SNPs are unscored by the input pathogenicity score. Third, AnnotBoost leverages 75 diverse features from the baseline-LD model 26,27 , significantly more than previous approaches 7, 9,10 . Indeed, we determined that AnnotBoost produces strong signals even when conditioned on those approaches.…”
Section: Discussionmentioning
confidence: 99%
“…As recently pointed out in [28], pathogenic scores predicted by several state-of-the-art methods are biased towards some specific regulatory region types. Indeed also with Mendelian and GWAS data the positive set of variants is located in different functional non-coding regions (like 5'UTR, 3'UTR or Promoter) and is not evenly distributed over them.…”
Section: Assessment Of the Effect On Prediction Performance Of The Vamentioning
confidence: 97%
“…The first one (GWAVA) applied a modified random forest [26], where its decision trees are trained on artificially balanced data, thus reducing the imbalance of the data [27]. A second one (NCBoost) used gradient tree boosting learning machines with partially balanced data, achieving very competitive results in the prioritization of pathogenic Mendelian variants, even if the comparison with the other state-of-the-art methods have been performed without retraining them, but using only their pre-computed scores [28]. The unbalancing issue has been fully addressed by ReMM [29] and hyperSMURF [30], through the application of subsampling techniques to the "negative" neutral variants, and oversampling algorithms to the set of "positive" pathogenic variants.…”
Section: Introductionmentioning
confidence: 99%