You Zou scite author profile

Background: Next-generation sequencing (NGS) technologies have fostered an unprecedented proliferation of highthroughput sequencing projects and a concomitant development of novel algorithms for the assembly of short reads. However, numerous technical or computational challenges in de novo assembly still remain, although many new ideas and solutions have been suggested to tackle the challenges in both experimental and computational settings. Results: In this review, we first briefly introduce some of the major challenges faced by NGS sequence assembly. Then, we analyze the characteristics of various sequencing platforms and their impact on assembly results. After that, we classify de novo assemblers according to their frameworks (overlap graph-based, de Bruijn graph-based and string graph-based), and introduce the characteristics of each assembly tool and their adaptation scene. Next, we introduce in detail the solutions to the main challenges of de novo assembly of next generation sequencing data, single-cell sequencing data and single molecule sequencing data. At last, we discuss the application of SMS long reads in solving problems encountered in NGS assembly. Conclusions: This review not only gives an overview of the latest methods and developments in assembly algorithms, but also provides guidelines to determine the optimal assembly algorithm for a given input sequencing data type.

show abstract

Analysis of Long Noncoding Rnas for Acute Rejection and Graft Outcome in Kidney Transplant Biopsies

Zou

Zhang

Zhou

et al. 2019

Biomark. Med.

View full text Add to dashboard Cite

Aim: To analyse long noncoding RNAs (lncRNA) in kidney transplant biopsies. Methods: Using a data mining approach, we constructed expression profiles in kidney transplant cohorts (n = 1105) from Gene Expression Omnibus. Integrative analysis of the lncRNAs with acute rejection (AR), T-cell-mediated acute rejection (TCMR) and graft loss were performed. Results: Six lncRNAs were identified as are associated with AR in the training and validating datasets, and with a risk score was generated with 3-lncRNAs that were predictive of graft loss (AUC = 0.73). MIR155HG is associated with AR, TCMR and graft loss. Plus it might be involved in several graft rejection and immune associated pathways. Conclusion: Understanding the role of lncRNAs in AR and graft outcome in kidney transplant biopsies needs to be further investigated.

show abstract

Improvingde novoAssembly Based on Read Classification

Liao

Liu

et al. 2020

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

Due to sequencing bias, sequencing error and repeat problems, the genome assemblies usually contain misarrangements and gaps. When tackling these problems, current assemblers commonly consider the read libraries as a whole and adopt the same strategy to deal with them. In this paper, we present a new pipeline for genome assembly based on reads classification (ARC). ARC classifies reads into three categories according to the frequencies of k-mers they contain. The three categories refer to (1) low depth reads, which contain a certain low frequency k-mers and are often caused by sequencing errors or bias; (2) high depth reads, which contain a certain high frequency k-mers and usually come from repetitive regions; (3) normal depth reads, which are the rest of reads. After read classification, an existing assembler is used to assemble different read categories separately, which is beneficial to resolve problems in the genome assembly. ARC adopts loose assembly parameters for low depth reads, and strict assembly parameters for normal depth and high depth reads. We test ARC using five datasets. The experimental results show that, assemblers combining with ARC can generate better assemblies in terms of NA50, NGA50 and genome fraction.

show abstract

MultiNanopolish: refined grouping method for reducing redundant calculations in Nanopolish

Huang

Zou

et al. 2021

View full text Add to dashboard Cite

Motivation Compared with the second generation sequencing technologies, the third generation sequencing technologies allows us to obtain longer reads (average ∼10kbps, maximum 900kbps), but brings a higher error rate (∼15% error rate). Nanopolish is a variant and methylation detection tool based on Hidden Markov Model (HMM), which uses Oxford Nanopore sequencing data for signal-level analysis. Nanopolish can greatly improve the accuracy of assembly, whereas it is limited by long running time since most executive parts of Nanopolish is a serial and computationally expensive process. Results In this paper, we present an effective polishing tool, Multithreading Nanopolish (MultiNanopolish), which decomposes the whole process of iterative calculation in Nanopolish into small independent calculation tasks, making it possible to run this process in the parallel mode. Experimental results show that MultiNanopolish reduces running time by 50% with read-uncorrected assembler (Miniasm) and 20% with read-corrected assembler (Canu and Flye) based on 40 threads mode compared to the original Nanopolish. Availability MultiNanopolish is available at GitHub: https://github.com/BioinformaticsCSU/MultiNanopolish Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

msRepDB: a comprehensive repetitive sequence database of over 80 000 species

Liao

Salhi

et al. 2021

View full text Add to dashboard Cite

Repeats are prevalent in the genomes of all bacteria, plants and animals, and they cover nearly half of the Human genome, which play indispensable roles in the evolution, inheritance, variation and genomic instability, and serve as substrates for chromosomal rearrangements that include disease-causing deletions, inversions, and translocations. Comprehensive identification, classification and annotation of repeats in genomes can provide accurate and targeted solutions towards understanding and diagnosis of complex diseases, optimization of plant properties and development of new drugs. RepBase and Dfam are two most frequently used repeat databases, but they are not sufficiently complete. Due to the lack of a comprehensive repeat database of multiple species, the current research in this field is far from being satisfactory. LongRepMarker is a new framework developed recently by our group for comprehensive identification of genomic repeats. We here propose msRepDB based on LongRepMarker, which is currently the most comprehensive multi-species repeat database, covering >80 000 species. Comprehensive evaluations show that msRepDB contains more species, and more complete repeats and families than RepBase and Dfam databases. (https://msrepdb.cbrc.kaust.edu.sa/pages/msRepDB/index.html).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

You Zou

Current challenges and solutions of de novo assembly

Analysis of Long Noncoding Rnas for Acute Rejection and Graft Outcome in Kidney Transplant Biopsies

Improvingde novoAssembly Based on Read Classification

MultiNanopolish: refined grouping method for reducing redundant calculations in Nanopolish

msRepDB: a comprehensive repetitive sequence database of over 80 000 species

Contact Info

Product

Resources

About