Improving<i>de novo</i>Assembly Based on Read Classification

Liao, Xingyu; Li, Min; Liu, Junwei; Zou, You; Wu, Fang‐Xiang; Pan, Yi; Luo, Feng

doi:10.1109/tcbb.2018.2861380

Cited by 17 publications

(9 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…9 ). RepAHR estimates the average read coverage using a method similar to that in literature [ 28 ]. The calculation principle is shown as follows: Where p is the horizontal coordinate of the main peak in the k-mer frequency distribution histogram, length is the average length of the input NGS reads, k is the k-mer length used in estimation which is settled to 15 by default, and Cov is the average read coverage estimated.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

et al. 2020

Self Cite

View full text Add to dashboard Cite

Background Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. Results In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. Conlusions We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.

show abstract

Section: Methodsmentioning

confidence: 99%

“…9). RepAHR estimates the average read coverage using a method similar to that in literature [28]. The calculation principle is shown as follows:…”

Section: Estimating the Average Read Coveragementioning

confidence: 99%

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although considerable third generation sequencing data has been produced, due to the higher cost per base and higher sequencing errors, NGS sequencing data still plays an important role in tackling an increasing list of biological problems. The de novo genome assembly is a fundamental process for computational biology (Schatz et al, 2010), which drives the generation of many assemblers to complete the construction of genome sequences, such as Velvet (Zerbino and Birney, 2008), ABySS (Simpson et al, 2009), ALLPATHS-LG (Gnerre and Jaffe, 2011), SOAPdenovo (Li et al, 2010), EPGA2 (Luo et al, 2015), Miniasm (Li, 2015), BOSS , SCOP (Li et al, 2018a), ARC (Liao et al, 2018), iLSLS (Li et al, 2018b), MEC , EPGA-SC (Liao et al, 2019a), PE-Trimmer (Liao et al, 2019b), and so on.…”

Section: Introductionmentioning

confidence: 99%

MAC: Merging Assemblies by Using Adjacency Algebraic Model and Classification

Tang

et al. 2020

Front. Genet.

Self Cite

View full text Add to dashboard Cite

With the generation of a large amount of sequencing data, different assemblers have emerged to perform de novo genome assembly. As a single strategy is hard to fit various biases of datasets, none of these tools outperforms the others on all species. The process of assembly reconciliation is to merge multiple assemblies and generate a high-quality consensus assembly. Several assembly reconciliation tools have been proposed. However, the existing reconciliation tools cannot produce a merged assembly which has better contiguity and contains less errors simultaneously, and the results of these tools usually depend on the ranking of input assemblies. In this study, we propose a novel assembly reconciliation tool MAC, which merges assemblies by using the adjacency algebraic model and classification. In order to solve the problem of uneven sequencing depth and sequencing errors, MAC identifies consensus blocks between contig sets to construct an adjacency graph. To solve the problem of repetitive region, MAC employs classification to optimize the adjacency algebraic model. What's more, MAC designs an overall scoring function to solve the problem of unknown ranking of input assembly sets. The experimental results from four species of GAGE-B demonstrate that MAC outperforms other assembly reconciliation tools.

show abstract

“…Read sequencing from next-generation sequencing (NGS) technology (Miller et al, 2010 ), is usually short, i.e., only a few hundred base pairs in length. Short reads commonly cannot be used to solve problems caused by long repetitive regions (Liao et al, 2020 ). In addition, NGS polymers commonly lead to some GC bias, which will affect the correctness of the genome assembly (Farrer et al, 2009 ; Luo et al, 2012 ).…”

Section: Introductionmentioning

confidence: 99%

LROD: An Overlap Detection Algorithm for Long Reads Based on k-mer Distribution

et al. 2020

Self Cite

View full text Add to dashboard Cite

Third-generation sequencing technologies can produce large numbers of long reads, which have been widely used in many fields. When using long reads for genome assembly, overlap detection between any pair of long reads is an important step. However, the sequencing error rate of third-generation sequencing technologies is very high, and obtaining accurate overlap detection results is still a challenging task. In this study, we present a long-read overlap detection (LROD) algorithm that can improve the accuracy of overlap detection results. To detect overlaps between two long reads, LROD first retains only the solid common k -mers between them. These k -mers can simplify the process of overlap detection. Second, LROD finds a chain (i.e., candidate overlap) that includes the consistent common k -mers. In this step, LROD proposes a two-stage strategy to evaluate whether two common k -mers are consistent. Finally, LROD uses a novel strategy to determine whether the candidate overlaps are true and to revise them. To verify the performance of LROD, three simulated and three real long-read datasets are used in the experiments. Compared with two other popular methods (MHAP and Minimap2), LROD can achieve good performance in terms of the F1-score, precision and recall. LROD is available from https://github.com/luojunwei/LROD .

show abstract

Improvingde novoAssembly Based on Read Classification

Cited by 17 publications

References 29 publications

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

MAC: Merging Assemblies by Using Adjacency Algebraic Model and Classification

LROD: An Overlap Detection Algorithm for Long Reads Based on k-mer Distribution

Contact Info

Product

Resources

About