A Local Outlier Factor-Based Detection of Copy Number Variations From NGS Data

Yuan, Xiguo; Li, Junping; Bai, Jun; Xi, Jianing

doi:10.1109/tcbb.2019.2961886

Cited by 25 publications

(49 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To obtain a complete and reasonable RC profile, we discard the "N" areas in the reference template. This is similar to our previous work (Yuan et al, 2019a). However, this will inevitably lead to a situation: the observed mean RC across the whole genome is smaller than the expected coverage depth since a set of sequencing reads that originated from "N" areas are unmapped.…”

Section: Data Input and Preprocesssupporting

confidence: 93%

“…A read count (RC) profile can be obtained from this BAM file by using the SAMtools software . The template of the RC profile is the reference genome, where a large majority of the positions have been determined with the regular bases ("A, " "T, " "C, " and "G"), while a small fraction of them have not been determined (Yuan et al, 2019a). The undetermined positions are usually filled with letter "N" so that no sequencing reads could be matched to these areas.…”

Section: Data Input and Preprocessmentioning

confidence: 99%

“…Currently, a lot of computational methods have already been proposed to detect CNVs on targeted, whole-exome, or wholegenome sequencing data. These methods could be generally classified into four categories: read depth (RD), paired-end mapping, split-read, and de novo assembly (Zhao et al, 2013;Mason-Suares et al, 2016;Yuan et al, 2019a). Since the size of CNVs is typically ranging from 1 kb to several mega bases (Freeman et al, 2006) while the length of the sequencing reads is usually limited to hundreds of bases, the RD-based methods are expected to have the most potential to accurately detect CNVs in a wide range of sizes (Yuan et al, 2019a).…”

Section: Introductionmentioning

confidence: 99%

“…These methods could be generally classified into four categories: read depth (RD), paired-end mapping, split-read, and de novo assembly (Zhao et al, 2013;Mason-Suares et al, 2016;Yuan et al, 2019a). Since the size of CNVs is typically ranging from 1 kb to several mega bases (Freeman et al, 2006) while the length of the sequencing reads is usually limited to hundreds of bases, the RD-based methods are expected to have the most potential to accurately detect CNVs in a wide range of sizes (Yuan et al, 2019a). One of the most popular RD-based methods is CNVnator (Abyzov et al, 2011), which adopts a mean-shift technique (Comaniciu and Meer, 2002) to partition the observed RD profile into segments with presumably different copy numbers, merges segments with minimal difference in RD by a greedy algorithm, and then makes CNV calls via a t-test procedure.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MFCNV: A New Method to Detect Copy Number Variations From Next-Generation Sequencing Data

Zhao

Huang

et al. 2020

Front. Genet.

Self Cite

View full text Add to dashboard Cite

Copy number variation (CNV) is a very important phenomenon in tumor genomes and plays a significant role in tumor genesis. Accurate detection of CNVs has become a routine and necessary procedure for a deep investigation of tumor cells and diagnosis of tumor patients. Next-generation sequencing (NGS) technique has provided a wealth of data for the detection of CNVs at base-pair resolution. However, such task is usually influenced by a number of factors, including GC-content bias, sequencing errors, and correlations among adjacent positions within CNVs. Although many existing methods have dealt with some of these artifacts by designing their own strategies, there is still a lack of comprehensive consideration of all the factors. In this paper, we propose a new method, MFCNV, for an accurate detection of CNVs from NGS data. Compared with existing methods, the characteristics of the proposed method include the following: (1) it makes a full consideration of the intrinsic correlations among adjacent positions in the genome to be analyzed, (2) it calculates read depth, GC-content bias, base quality, and correlation value for each genome bin and combines them as multiple features for the evaluation of genome bins, and (3) it addresses the joint effect among the factors via training a neural network algorithm for the prediction of CNVs. We test the performance of the MFCNV method by using simulation and real sequencing data and make comparisons with several peer methods. The results demonstrate that our method is superior to other methods in terms of sensitivity, precision, and F1-score and can detect many CNVs that other methods have not discovered. MFCNV is expected to be a complementary tool in the analysis of mutations in tumor genomes and can be extended to be applied to the analysis of single-cell sequencing data.

show abstract

Section: Data Input and Preprocesssupporting

confidence: 93%

Section: Data Input and Preprocessmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

MFCNV: A New Method to Detect Copy Number Variations From Next-Generation Sequencing Data

Zhao

Huang

et al. 2020

Front. Genet.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Considering the importance of genomic positions, we combine smoothed RD signals and their corresponding genomic positions to transform the smoothed RD signals in one-dimensional space into a two-dimensional profile. The details of this transformation are described in our previous study [28]. In this way, we can observe RD signals from both horizontal and vertical levels, which reflect copy number amplitude and positional space, respectively.…”

Section: A Bias Correction and Segmentationmentioning

confidence: 98%

dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data

Zhang

Yuan

et al. 2020

IEEE Access

Self Cite

View full text Add to dashboard Cite

Comprehensive identification and cataloging of copy number variation (CNVs) are essential to providing a complete view of human genetic variation and to finding diseased genes. Due to the large-scale sequencing and cost control whole-genome sequencing (WGS) data, low-coverage data is favorably disposed towards CNV identification. However, such low-coverage data is sensitive to noise and sequencing biases, which results in low resolution of CNV detection in past experimental designs for WGS datasets. In this paper, we present a control-free Dirichlet process Gaussian mixture model (dpGMM) based approach, to analyze the read depth (RD) of low-coverage WGS datasets for CNV discovery. First, noise and biases of the RD signals are corrected through the preprocessing step of dpGMM. Then we assume that RD signals across genomic regions follow a Gaussian mixture model (GMM) in which each Gaussian distribution is followed by a copy number state. Without requiring the number of Gaussian distributions, dpGMM builds a Dirichlet process (DP) GMM for RD signals and further uses a DP prior to infer the number of Gaussian models. After that, we apply dpGMM to simulation datasets with different coverages and individual datasets, and compare ours to three widely used RD-based pipelines, CNVnator, GROM-RD, and BIC-seq2. Simulation results demonstrate that our approach, dpGMM, has a high F1 score in both low-and highcoverage sequences. Also, the number of overlaps between CNVs detected in real data by ours and the standard benchmark is twice as much as that detected by other tools such as CNVnator and GROM-RD. INDEX TERMS Copy number variation, Dirichlet process, Gaussian mixture model, read depth, low coverage.

show abstract

Critical evaluation of CNA estimators for DNA data using matching confidence masks and WGS technology

Muñoz‐Minjares

Shmaliy

Popova

2021

Biomedical Signal Processing and Control

View full text Add to dashboard Cite

A Local Outlier Factor-Based Detection of Copy Number Variations From NGS Data

Cited by 25 publications

References 57 publications

MFCNV: A New Method to Detect Copy Number Variations From Next-Generation Sequencing Data

MFCNV: A New Method to Detect Copy Number Variations From Next-Generation Sequencing Data

dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data

Critical evaluation of CNA estimators for DNA data using matching confidence masks and WGS technology

Contact Info

Product

Resources

About