2019
DOI: 10.1093/bioinformatics/btz952
|View full text |Cite
|
Sign up to set email alerts
|

VEF: a variant filtering tool based on ensemble methods

Abstract: Motivation Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these me… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
8
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(9 citation statements)
references
References 16 publications
0
8
0
Order By: Relevance
“…Secondly, EMVC-2 uses a Decision Tree Classifier (DTC) ( Pedregosa et al 2011 ) to filter the untrue SNV candidates identified in the first step. A DTC is chosen as models based on DTs have been shown to discriminate well between true and false called variants in similar settings ( Zhang and Ochoa 2020 , Cooke et al 2021 ). Moreover, we refrain from using more complex models such as neural networks due to overfitting concerns.…”
Section: Methods and Experimental Resultsmentioning
confidence: 99%
“…Secondly, EMVC-2 uses a Decision Tree Classifier (DTC) ( Pedregosa et al 2011 ) to filter the untrue SNV candidates identified in the first step. A DTC is chosen as models based on DTs have been shown to discriminate well between true and false called variants in similar settings ( Zhang and Ochoa 2020 , Cooke et al 2021 ). Moreover, we refrain from using more complex models such as neural networks due to overfitting concerns.…”
Section: Methods and Experimental Resultsmentioning
confidence: 99%
“…They use features related to the genomic position in which the call was made, such as GC content and proximity to homopolymers as well as quality metrics associated with variant calling. Other studies that use quality metrics extracted from VCF files include ( Zhang and Ochoa 2020 ) and ( Holt et al 2021 ), which use methods such as Gradient Boosting, EasyEnsemble, Random Forest, and AdaBoost. Both studies use Genome in a Bottle Consortium (GIAB) truth sets and train separate models for insertion/deletion variants (Indels) and Single Nucleotide Variations (SNVs).…”
Section: Introductionmentioning
confidence: 99%
“…Current state-of-the-art filtering methods include Frequency 25 , Hard-Filter 20 , VQSR 26 , GARFIELD 27 , VEF 28 , ForestQC 29 and so on, which employ different strategies in addressing the filtering task. The Frequency model defines variant calls with the variant allelic frequency (VAF) less than 20% or the allelic depth (AD) less than 5 as false variants.…”
Section: Introductionmentioning
confidence: 99%
“…ForestQC cannot be utilized on single-sample sequencing data. Five of these filtering methods (Frequency, VQSR, Hard-Filter, GARFIELD, and VEF) are available for quality control of variants from single-sample sequencing data and showed high performance in F1-major and accuracy 28 . However, these state-of-the-art methods were unsatisfactory when measured with the Matthews correlation coefficient (MCC) metric 27 , which is a highly suggested measurement for imbalanced data 34 , i.e., the WGS variant calls.…”
Section: Introductionmentioning
confidence: 99%