Xiaoqiu Huang scite author profile

We describe the third generation of the CAP sequence assembly program. The CAP3 program includes a number of improvements and new features. The program has a capability to clip 5Ј and 3Ј low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward-reverse constraints to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward-reverse constraints.The shotgun sequencing strategy has been used widely in genome sequencing projects. A major phase in this strategy is to assemble short reads into long sequences. A number of DNA sequence assembly programs have been developed (Staden 1980;Peltola et al. 1984;Huang 1992;Smith et al. 1993;Gleizes and Henaut 1994;Lawrence et al. 1994;Kececioglu and Myers 1995;Sutton et al. 1995;Green 1996). The FAKII program provides a library of routines for each phase of the assembly process (Larson et al. 1996). The GAP4 program has a number of useful interactive features (Bonfield et al. 1995). The PHRAP program clips 5Ј and 3Ј low-quality regions of reads and uses base quality values in evaluation of overlaps and generation of contig sequences (Green 1996). TIGR Assembler has been used in a number of megabase microbial genome projects (Sutton et al. 1995). Continued development and improvement of sequence assembly programs are required to meet the challenges of the human, mouse, and maize genome projects.We have developed the third generation of the CAP sequence assembly program (Huang 1992). The CAP3 program includes a number of improvements and new features. A capability to clip 5Ј and 3Ј lowquality regions of reads is included in the CAP3 program. Base quality values produced by PHRED ) are used in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. Efficient algorithms are employed to identify and compute overlaps between reads. Forward-reverse constraints are used to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward-reverse constraints.An unusual feature of CAP3 is the use of forwardreverse constraints in the construction of contigs. A forward-reverse constraint is often produced by sequencing of both ends of a subclone. A forward-reverse constraint specifies that the two...

show abstract

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets

Pertea¹,

Huang²,

Feng³

et al. 2003

1,690

1,198

View full text Add to dashboard Cite

show abstract

Initial sequence of the chimpanzee genome and comparison with the human genome

Mikkelsen¹,

Hillier²,

Eichler³

et al. 2005

Nature

2,106

743

View full text Add to dashboard Cite

Here we present a draft genome sequence of the common chimpanzee (Pan troglodytes). Through comparison with the human genome, we have generated a largely complete catalogue of the genetic differences that have accumulated since the human and chimpanzee species diverged from our common ancestor, constituting approximately thirty-five million single-nucleotide changes, five million insertion/deletion events, and various chromosomal rearrangements. We use this catalogue to explore the magnitude and regional variation of mutational forces shaping these two genomes, and the strength of positive and negative selection acting on their genes. In particular, we find that the patterns of evolution in human and chimpanzee protein-coding genes are highly correlated and dominated by the fixation of neutral and slightly deleterious alleles. We also use the chimpanzee genome as an outgroup to investigate human population genetics and identify signatures of selective sweeps in recent human evolution.

show abstract

Evolutionary and Biomedical Insights from the Rhesus Macaque Genome

Gibbs

Rogers

Katze

et al. 2007

Science

1,213

713

View full text Add to dashboard Cite

The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineagespecific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species.

show abstract

A time-efficient, linear-space local similarity algorithm

Huang

Miller

1991

Advances in Applied Mathematics

910

551

View full text Add to dashboard Cite

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Earl¹,

Bradnam²,

John³

et al. 2011

Genome Res.

464

418

View full text Add to dashboard Cite

Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.

show abstract

Over 20% of human transcripts might form sense-antisense pairs

Chen

Sun²,

Kent³

et al. 2004

Nucleic Acids Research

299

337

View full text Add to dashboard Cite

The major challenge to identifying natural sense- antisense (SA) transcripts from public databases is how to determine the correct orientation for an expressed sequence, especially an expressed sequence tag sequence. In this study, we established a set of very stringent criteria to identify the correct orientation of each human transcript. We used these orientation-reliable transcripts to create 26 741 transcription clusters in the human genome. Our analysis shows that 22% (5880) of the human transcription clusters form SA pairs, higher than any previous estimates. Our orientation-specific RT-PCR results along with the comparison of experimental data from previous studies confirm that our SA data set is reliable. This study not only demonstrates that our criteria for the prediction of SA transcripts are efficient, but also provides additional convincing data to support the view that antisense transcription is quite pervasive in the human genome. In-depth analyses show that SA transcripts have some significant differences compared with other types of transcripts, with regard to chromosomal distribution and Gene Ontology-annotated categories of physiological roles, functions and spatial localizations of gene products.

show abstract

Genome analysis of the platypus reveals unique signatures of evolution

Warren¹,

Hillier²,

Graves³

et al. 2008

Nature

603

318

View full text Add to dashboard Cite

A list of authors and their affiliations appears at the end of the paperWe present a draft genome sequence of the platypus, Ornithorhynchus anatinus. This monotreme exhibits a fascinating combination of reptilian and mammalian characters. For example, platypuses have a coat of fur adapted to an aquatic lifestyle; platypus females lactate, yet lay eggs; and males are equipped with venom similar to that of reptiles. Analysis of the first monotreme genome aligned these features with genetic innovations. We find that reptile and platypus venom proteins have been co-opted independently from the same gene families; milk protein genes are conserved despite platypuses laying eggs; and immune gene family expansions are directly related to platypus biology. Expansions of protein, non-protein-coding RNA and microRNA families, as well as repeat elements, are identified. Sequencing of this genome now provides a valuable resource for deep mammalian comparative analyses, as well as for monotreme biology and conservation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xiaoqiu Huang

CAP3: A DNA Sequence Assembly Program

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets

Initial sequence of the chimpanzee genome and comparison with the human genome

Evolutionary and Biomedical Insights from the Rhesus Macaque Genome

A time-efficient, linear-space local similarity algorithm

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Over 20% of human transcripts might form sense-antisense pairs

Genome analysis of the platypus reveals unique signatures of evolution

Contact Info

Product

Resources

About