“…To obtain a complete and reasonable RC profile, we discard the "N" areas in the reference template. This is similar to our previous work (Yuan et al, 2019a). However, this will inevitably lead to a situation: the observed mean RC across the whole genome is smaller than the expected coverage depth since a set of sequencing reads that originated from "N" areas are unmapped.…”
Section: Data Input and Preprocesssupporting
confidence: 93%
“…A read count (RC) profile can be obtained from this BAM file by using the SAMtools software . The template of the RC profile is the reference genome, where a large majority of the positions have been determined with the regular bases ("A, " "T, " "C, " and "G"), while a small fraction of them have not been determined (Yuan et al, 2019a). The undetermined positions are usually filled with letter "N" so that no sequencing reads could be matched to these areas.…”
Section: Data Input and Preprocessmentioning
confidence: 99%
“…Currently, a lot of computational methods have already been proposed to detect CNVs on targeted, whole-exome, or wholegenome sequencing data. These methods could be generally classified into four categories: read depth (RD), paired-end mapping, split-read, and de novo assembly (Zhao et al, 2013;Mason-Suares et al, 2016;Yuan et al, 2019a). Since the size of CNVs is typically ranging from 1 kb to several mega bases (Freeman et al, 2006) while the length of the sequencing reads is usually limited to hundreds of bases, the RD-based methods are expected to have the most potential to accurately detect CNVs in a wide range of sizes (Yuan et al, 2019a).…”
Section: Introductionmentioning
confidence: 99%
“…These methods could be generally classified into four categories: read depth (RD), paired-end mapping, split-read, and de novo assembly (Zhao et al, 2013;Mason-Suares et al, 2016;Yuan et al, 2019a). Since the size of CNVs is typically ranging from 1 kb to several mega bases (Freeman et al, 2006) while the length of the sequencing reads is usually limited to hundreds of bases, the RD-based methods are expected to have the most potential to accurately detect CNVs in a wide range of sizes (Yuan et al, 2019a). One of the most popular RD-based methods is CNVnator (Abyzov et al, 2011), which adopts a mean-shift technique (Comaniciu and Meer, 2002) to partition the observed RD profile into segments with presumably different copy numbers, merges segments with minimal difference in RD by a greedy algorithm, and then makes CNV calls via a t-test procedure.…”
Copy number variation (CNV) is a very important phenomenon in tumor genomes and plays a significant role in tumor genesis. Accurate detection of CNVs has become a routine and necessary procedure for a deep investigation of tumor cells and diagnosis of tumor patients. Next-generation sequencing (NGS) technique has provided a wealth of data for the detection of CNVs at base-pair resolution. However, such task is usually influenced by a number of factors, including GC-content bias, sequencing errors, and correlations among adjacent positions within CNVs. Although many existing methods have dealt with some of these artifacts by designing their own strategies, there is still a lack of comprehensive consideration of all the factors. In this paper, we propose a new method, MFCNV, for an accurate detection of CNVs from NGS data. Compared with existing methods, the characteristics of the proposed method include the following: (1) it makes a full consideration of the intrinsic correlations among adjacent positions in the genome to be analyzed, (2) it calculates read depth, GC-content bias, base quality, and correlation value for each genome bin and combines them as multiple features for the evaluation of genome bins, and (3) it addresses the joint effect among the factors via training a neural network algorithm for the prediction of CNVs. We test the performance of the MFCNV method by using simulation and real sequencing data and make comparisons with several peer methods. The results demonstrate that our method is superior to other methods in terms of sensitivity, precision, and F1-score and can detect many CNVs that other methods have not discovered. MFCNV is expected to be a complementary tool in the analysis of mutations in tumor genomes and can be extended to be applied to the analysis of single-cell sequencing data.
“…To obtain a complete and reasonable RC profile, we discard the "N" areas in the reference template. This is similar to our previous work (Yuan et al, 2019a). However, this will inevitably lead to a situation: the observed mean RC across the whole genome is smaller than the expected coverage depth since a set of sequencing reads that originated from "N" areas are unmapped.…”
Section: Data Input and Preprocesssupporting
confidence: 93%
“…A read count (RC) profile can be obtained from this BAM file by using the SAMtools software . The template of the RC profile is the reference genome, where a large majority of the positions have been determined with the regular bases ("A, " "T, " "C, " and "G"), while a small fraction of them have not been determined (Yuan et al, 2019a). The undetermined positions are usually filled with letter "N" so that no sequencing reads could be matched to these areas.…”
Section: Data Input and Preprocessmentioning
confidence: 99%
“…Currently, a lot of computational methods have already been proposed to detect CNVs on targeted, whole-exome, or wholegenome sequencing data. These methods could be generally classified into four categories: read depth (RD), paired-end mapping, split-read, and de novo assembly (Zhao et al, 2013;Mason-Suares et al, 2016;Yuan et al, 2019a). Since the size of CNVs is typically ranging from 1 kb to several mega bases (Freeman et al, 2006) while the length of the sequencing reads is usually limited to hundreds of bases, the RD-based methods are expected to have the most potential to accurately detect CNVs in a wide range of sizes (Yuan et al, 2019a).…”
Section: Introductionmentioning
confidence: 99%
“…These methods could be generally classified into four categories: read depth (RD), paired-end mapping, split-read, and de novo assembly (Zhao et al, 2013;Mason-Suares et al, 2016;Yuan et al, 2019a). Since the size of CNVs is typically ranging from 1 kb to several mega bases (Freeman et al, 2006) while the length of the sequencing reads is usually limited to hundreds of bases, the RD-based methods are expected to have the most potential to accurately detect CNVs in a wide range of sizes (Yuan et al, 2019a). One of the most popular RD-based methods is CNVnator (Abyzov et al, 2011), which adopts a mean-shift technique (Comaniciu and Meer, 2002) to partition the observed RD profile into segments with presumably different copy numbers, merges segments with minimal difference in RD by a greedy algorithm, and then makes CNV calls via a t-test procedure.…”
Copy number variation (CNV) is a very important phenomenon in tumor genomes and plays a significant role in tumor genesis. Accurate detection of CNVs has become a routine and necessary procedure for a deep investigation of tumor cells and diagnosis of tumor patients. Next-generation sequencing (NGS) technique has provided a wealth of data for the detection of CNVs at base-pair resolution. However, such task is usually influenced by a number of factors, including GC-content bias, sequencing errors, and correlations among adjacent positions within CNVs. Although many existing methods have dealt with some of these artifacts by designing their own strategies, there is still a lack of comprehensive consideration of all the factors. In this paper, we propose a new method, MFCNV, for an accurate detection of CNVs from NGS data. Compared with existing methods, the characteristics of the proposed method include the following: (1) it makes a full consideration of the intrinsic correlations among adjacent positions in the genome to be analyzed, (2) it calculates read depth, GC-content bias, base quality, and correlation value for each genome bin and combines them as multiple features for the evaluation of genome bins, and (3) it addresses the joint effect among the factors via training a neural network algorithm for the prediction of CNVs. We test the performance of the MFCNV method by using simulation and real sequencing data and make comparisons with several peer methods. The results demonstrate that our method is superior to other methods in terms of sensitivity, precision, and F1-score and can detect many CNVs that other methods have not discovered. MFCNV is expected to be a complementary tool in the analysis of mutations in tumor genomes and can be extended to be applied to the analysis of single-cell sequencing data.
“…Considering the importance of genomic positions, we combine smoothed RD signals and their corresponding genomic positions to transform the smoothed RD signals in one-dimensional space into a two-dimensional profile. The details of this transformation are described in our previous study [28]. In this way, we can observe RD signals from both horizontal and vertical levels, which reflect copy number amplitude and positional space, respectively.…”
Section: A Bias Correction and Segmentationmentioning
Comprehensive identification and cataloging of copy number variation (CNVs) are essential to providing a complete view of human genetic variation and to finding diseased genes. Due to the large-scale sequencing and cost control whole-genome sequencing (WGS) data, low-coverage data is favorably disposed towards CNV identification. However, such low-coverage data is sensitive to noise and sequencing biases, which results in low resolution of CNV detection in past experimental designs for WGS datasets. In this paper, we present a control-free Dirichlet process Gaussian mixture model (dpGMM) based approach, to analyze the read depth (RD) of low-coverage WGS datasets for CNV discovery. First, noise and biases of the RD signals are corrected through the preprocessing step of dpGMM. Then we assume that RD signals across genomic regions follow a Gaussian mixture model (GMM) in which each Gaussian distribution is followed by a copy number state. Without requiring the number of Gaussian distributions, dpGMM builds a Dirichlet process (DP) GMM for RD signals and further uses a DP prior to infer the number of Gaussian models. After that, we apply dpGMM to simulation datasets with different coverages and individual datasets, and compare ours to three widely used RD-based pipelines, CNVnator, GROM-RD, and BIC-seq2. Simulation results demonstrate that our approach, dpGMM, has a high F1 score in both low-and highcoverage sequences. Also, the number of overlaps between CNVs detected in real data by ours and the standard benchmark is twice as much as that detected by other tools such as CNVnator and GROM-RD. INDEX TERMS Copy number variation, Dirichlet process, Gaussian mixture model, read depth, low coverage.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.