2019
DOI: 10.1093/nar/gkz774
|View full text |Cite
|
Sign up to set email alerts
|

regBase: whole genome base-wise aggregation and functional prediction for human non-coding regulatory variants

Abstract: Predicting the functional or pathogenic regulatory variants in the human non-coding genome facilitates the interpretation of disease causation. While numerous prediction methods are available, their performance is inconsistent or restricted to specific tasks, which raises the demand of developing comprehensive integration for those methods. Here, we compile whole genome base-wise aggregations, regBase, that incorporate largest prediction scores. Building on different assumptions of causality, we train three co… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
68
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 52 publications
(70 citation statements)
references
References 76 publications
2
68
0
Order By: Relevance
“…Ensemble learning methods generally surpass unsupervised methods when high-quality training data of appropriate type and quantity are available ( 37 ). Consistent with recent study on prediction of non-coding regulatory variants ( 46 ), ensemble learning methods including XGBoost, RF and GBT exhibit better performance than conventional SVM classifier in all training datasets. Moreover, the model trained by XGBoost algorithm shows the best prediction performance.…”
Section: Discussionsupporting
confidence: 84%
See 1 more Smart Citation
“…Ensemble learning methods generally surpass unsupervised methods when high-quality training data of appropriate type and quantity are available ( 37 ). Consistent with recent study on prediction of non-coding regulatory variants ( 46 ), ensemble learning methods including XGBoost, RF and GBT exhibit better performance than conventional SVM classifier in all training datasets. Moreover, the model trained by XGBoost algorithm shows the best prediction performance.…”
Section: Discussionsupporting
confidence: 84%
“…Each DT comprises a series of rules that semi-optimally split the training data. Its sparsity-aware split search approach makes it suitable for our dataset where missing values commonly appear ( 46 ). Successive trees that ‘correct’ the errors in the initial tree were learned to improve the classification of positive and negative training examples.…”
Section: Methodsmentioning
confidence: 99%
“…To determine which of the CFTR introns could be involved in gene regulation, we used the GWAS3D score proposed by Li et al [ 38 ]. This score is based on different functional information of regulation, such as Dnase-seq, TF ChIP-Seq, histone modifications and 5C data [ 39 , 40 ], and a higher score is predictive of a more important functional impact of the variant. We compared the distribution of the scores of variants in the 1000 Genomes project [ 41 ] between all CFTR introns with the hypothesis that an intron showing more variants with high GWAS3D scores has a stronger functional effect.…”
Section: Resultsmentioning
confidence: 99%
“…To investigate this question, we used the GWAS3D score [ 38 ] to identify introns enriched in important functional variants in the CFTR gene regulation in the general population. This score is based on different functional information such as DNA-seq, TF ChIP-Seq, histone modifications and 5C data [ 40 ], and enables us to rank variants according to their predicted functional impact. We identify four introns, introns 26, 24, 1 and 12, in order of importance ( Figure 1 ) showing GWAS3D scores significantly higher than the other introns.…”
Section: Discussionmentioning
confidence: 99%
“…Nevertheless, a new model with a novel experimental approach, CRISPRi-FlowFISH, has been proposed for interpreting the functions of variants in non-coding regions [ 98 ]. In addition, various in silico prediction tools for non-coding regions are being developed, including regBase [ 99 ], RegSNPs-intron [ 100 ], and GRAM [ 101 ]. Overall, a better understanding of variants in coding and non-coding regions and single variant annotation across whole genome would take advantage of population-based sequencing data to provide great benefits to human health.…”
Section: Future Perspectivesmentioning
confidence: 99%