Wiktor Kuśmirek scite author profile

Nowak

2018

BMC Bioinformatics

BackgroundMany organisms, in particular bacteria, contain repetitive DNA fragments called tandem repeats. These structures are restored by DNA assemblers by mapping paired-end tags to unitigs, estimating the distance between them and filling the gap with the specified DNA motif, which could be repeated many times. However, some of the tandem repeats are longer than the distance between the paired-end tags.ResultsWe present a new algorithm for de novo DNA assembly, which uses the relative frequency of reads to properly restore tandem repeats. The main advantage of the presented algorithm is that long tandem repeats, which are much longer than maximum reads length and the insert size of paired-end tags can be properly restored. Moreover, repetitive DNA regions covered only by single-read sequencing data could also be restored. Other existing de novo DNA assemblers fail in such cases.The presented application is composed of several steps, including: (i) building the de Bruijn graph, (ii) correcting the de Bruijn graph, (iii) normalizing edge weights, and (iv) generating the output set of DNA sequences.We tested our approach on real data sets of bacterial organisms.ConclusionsThe software library, console application and web application were developed. Web application was developed in client-server architecture, where web-browser is used to communicate with end-user and algorithms are implemented in C++ and Python. The presented approach enables proper reconstruction of tandem repeats, which are longer than the insert size of paired-end tags. The application is freely available to all users under GNU Library or Lesser General Public License version 3.0 (LGPLv3).

Hybrid de novo whole-genome assembly and annotation of the model tapeworm Hymenolepis diminuta

et al. 2019

Despite the use of Hymenolepis diminuta as a model organism in experimental parasitology, a full genome description has not yet been published. Here we present a hybrid de novo genome assembly based on complementary sequencing technologies and methods. The combination of Illumina paired-end, Illumina mate-pair and Oxford Nanopore Technology reads greatly improved the assembly of the H. diminuta genome. Our results indicate that the hybrid sequencing approach is the method of choice for obtaining high-quality data. The final genome assembly is 177 Mbp with contig N50 size of 75 kbp and a scaffold N50 size of 2.3 Mbp. We obtained one of the most complete cestode genome assemblies and annotated 15,169 potential protein-coding genes. The obtained data may help explain cestode gene function and better clarify the evolution of its gene families, and thus the adaptive features evolved during millennia of co-evolution with their hosts.

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

et al. 2019

Background There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio. Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance. Methods We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method. Results and Conclusions The performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms’ precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution. Electronic supplementary material The online version of this article (10.1186/s12859-019-2889-z) contains supplementary material, which is available to authorized users.

Linking De Novo Assembly Results with Long DNA Reads Using the dnaasm-link Application

BioMed Research International

Franus

Nowak

2019

Currently, third-generation sequencing techniques, which make it possible to obtain much longer DNA reads compared to the next-generation sequencing technologies, are becoming more and more popular. There are many possibilities for combining data from next-generation and third-generation sequencing. Herein, we present a new application called dnaasm-link for linking contigs, the result of de novo assembly of second-generation sequencing data, with long DNA reads. Our tool includes an integrated module to fill gaps with a suitable fragment of an appropriate long DNA read, which improves the consistency of the resulting DNA sequences. This feature is very important, in particular for complex DNA regions. Our implementation is found to outperform other state-of-the-art tools in terms of speed and memory requirements, which may enable its usage for organisms with a large genome, something which is not possible in existing applications. The presented application has many advantages: (i) it significantly optimizes memory and reduces computation time; (ii) it fills gaps with an appropriate fragment of a specified long DNA read; (iii) it reduces the number of spanned and unspanned gaps in existing genome drafts. The application is freely available to all users under GNU Library or Lesser General Public License version 3.0 (LGPLv3). The demo application, Docker image, and source code can be downloaded from project homepage.

SeQuiLa-cov: A fast and scalable library for depth of coverage calculations

Wiewiórka

Szmurło

et al. 2018

Preprint

BackgroundDepth of coverage calculation is an important and computationally intensive preprocessing step in a variety of next generation sequencing pipelines, including the analyses of RNA-seq data, detection of copy number variants, or quality control procedures. Results Building upon big data technologies, we have developed SeQuiLa-cov, an extension to the recently released SeQuiLa platform, which provides e cient depth of coverage calculations, reaching more than 100x speedup over the state-of-the-art tools. Performance and scalability of our solution allows for exome and genome-wide calculations running locally or on a cluster while hiding the complexity of the distributed computing with Structured Query Language Application Programming Interface. Conclusions SeQuiLa-cov provides signi cant performance gain in depth of coverage calculations streamlining the widely used bioinformatic processing pipelines.• SeQuiLa-cov allows for high-coverage (∼60x) genome-wide depth of coverage calculations in less than one minute. • SeQuiLa-cov provides ANSI SQL compliant API for accessing and analyzing of aligned sequencing reads data.

CNVind: an open source cloud-based pipeline for rare CNVs detection in whole exome sequencing data based on the depth of coverage

Nowak

2022

BMC Bioinformatics

Background A typical Copy Number Variations (CNVs) detection process based on the depth of coverage in the Whole Exome Sequencing (WES) data consists of several steps: (I) calculating the depth of coverage in sequencing regions, (II) quality control, (III) normalizing the depth of coverage, (IV) calling CNVs. Previous tools performed one normalization process for each chromosome—all the coverage depths in the sequencing regions from a given chromosome were normalized in a single run. Methods Herein, we present the new CNVind tool for calling CNVs, where the normalization process is conducted separately for each of the sequencing regions. The total number of normalizations is equal to the number of sequencing regions in the investigated dataset. For example, when analyzing a dataset composed of n sequencing regions, CNVind performs n independent depth of coverage normalizations. Before each normalization, the application selects the k most correlated sequencing regions with the depth of coverage Pearson’s Correlation as distance metric. Then, the resulting subgroup of $$k+1$$ k + 1 sequencing regions is normalized, the results of all n independent normalizations are combined; finally, the segmentation and CNV calling process is performed on the resultant dataset. Results and conclusions We used WES data from the 1000 Genomes project to evaluate the impact of independent normalization on CNV calling performance and compared the results with state-of-the-art tools: CODEX and exomeCopy. The results proved that independent normalization allows to improve the rare CNVs detection specificity significantly. For example, for the investigated dataset, we reduced the number of FP calls from over 15,000 to around 5000 while maintaining a constant number of TP calls equal to about 150 CNVs. However, independent normalization of each sequencing region is a computationally expensive process, therefore our pipeline is customized and can be easily run in the cloud computing environment, on the computer cluster, or the single CPU server. To our knowledge, the presented application is the first attempt to implement an innovative approach to independent normalization of the depth of WES data coverage.

SeQuiLa-cov: A fast and scalable library for depth of coverage calculations

Wiewiórka

Szmurło

et al. 2019

Background Depth of coverage calculation is an important and computationally intensive preprocessing step in a variety of next-generation sequencing pipelines, including the analysis of RNA-sequencing data, detection of copy number variants, or quality control procedures. Results Building upon big data technologies, we have developed SeQuiLa-cov, an extension to the recently released SeQuiLa platform, which provides efficient depth of coverage calculations, reaching >100× speedup over the state-of-the-art tools. The performance and scalability of our solution allow for exome and genome-wide calculations running locally or on a cluster while hiding the complexity of the distributed computing with Structured Query Language Application Programming Interface. Conclusions SeQuiLa-cov provides significant performance gain in depth of coverage calculations streamlining the widely used bioinformatic processing pipelines.

Clustering-based optimization method of reference set selection for improved CNV callers performance

Szmurło

Wiewiórka

et al. 2018

Preprint

Background: There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis.The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio.Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature.To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance. Methods:We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method.