A local alignment tool for very long DNA sequences

Chao, Kun-Mao; Zhang, Jinghui; Ostell, James; Miller, Webb

doi:10.1093/bioinformatics/11.2.147

Cited by 35 publications

(30 citation statements)

References 19 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Human and mouse mRNA and protein sequences were aligned using the sire2 program (Chao et al 1995) that, by sequence accession number, extracts sequence data directly from GenBank (Benson et al 1996) using the Entrez application programming interface (ftp:// ncbi.nlm.nih.gov/toolbox/ncbi_tools). This assures that the most recent data always was used.…”

Section: Methodsmentioning

confidence: 99%

Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences.

Makałowski

J²,

Boguski³

1996

Genome Res.

214

138

View full text Add to dashboard Cite

A large set of mRNA and encoded protein sequences, from orthologous murine and human genes, was compiled to analyze statistical, biological, and evolutionary properties of coding and noncoding transcribed sequences. Protein sequence conservation varied between 36% and 100% identity, with an average value of 85%. The average degree of nucleotide sequence identity for the corresponding coding sequences was also -85%, whereas S' and 3' untranslated regions IZUTRs} were less conserved, with aligned identities of 67% and 69%, respectively. For some mouse and human genes, nucleotide sequences are more highly conserved than the encoded protein sequences. A subset of 32 sequences, consisting of only mouse/human protein pairs for which the human sequence represents a positionally cloned disease gene, had properties very similar to the larger data set, suggesting that our data are representative of the genome as a whole. With respect to sequence conservation, two interesting outliers are the breast cancer {BRCAO gene product and the testis-determining factor {$RY}, both of which display among the lowest degrees of sequence identity. The occurrence of both introns and repetitive elements (e.g., Aiu, BI) in S' and 3' UTRs was also studied. These results provide one benchmark for the "'comparative genomics" of mice and humans, with practical implications for the cross-referencing of transcript maps. Also, they should prove useful in estimating the additional sampling diversity provided by mouse EST sequencing projects designed to complement the existing human cDNA collection.

show abstract

Section: Methodsmentioning

confidence: 99%

Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences.

Makałowski

J²,

Boguski³

1996

Genome Res.

214

138

View full text Add to dashboard Cite

show abstract

“…Both the BLAST search and Entrez access require connections to the servers at NCBI; all of the other processes are computed on the client machine. Prior to the BLAST search, SIM2 (Chao et al 1994) computes repeat regions in the query sequence, and the results are automatically annotated as repeat features in the query sequence. For a DNA query sequence, the low complexity regions are identified by the ''dust'' program (J. Kuzio, R. Tatusov, and D.J.…”

Section: Figurementioning

confidence: 99%

“…Organismspecific results can be obtained at any level of the NCBI taxonomy by filtering the HSP alignments inclusively or exclusively with Etrez Taxonomy Server. A suite of SIM algorithms, which include SIN (Huang et al 1990), SIM2 (Chao et al 1994), and SIM3 (Chao et al 1997) may be selected to compute more refined gapped alignments. The details of repeat filtering, processing of large sequences, restricting the search by organism, and gapped alignments are described below.…”

Section: Figurementioning

confidence: 99%

See 1 more Smart Citation

PowerBLAST: A New Network BLAST Application for Interactive or Automated Sequence Analysis and Annotation

Zhang

Madden²

1997

Genome Res.

Self Cite

311

191

View full text Add to dashboard Cite

As the rate of DNA sequencing increases, analysis by sequence similarity search will need to become much more efficient in terms of sensitivity, specificity, automation potential, and consistency in annotation. PowerBLAST was developed, in part, to address these problems. PowerBLAST includes a number of options for masking repetitive elements and low complexity subsequences. It also has the capacity to restrict the search to any level of NCBI’s taxonomy index, thus supporting “comparative genomics” applications. Postprocessing of the BLAST output using the SIM series of algorithms produces optimal, gapped alignments, and multiple alignments when a region of the query sequence matches multiple database sequences. PowerBLAST is capable of processing sequences of any length because it divides long query sequences into overlapping fragments and then merges the results after searching. The results may be viewed graphically, as a textual representation, or as an HTML page with links to GenBank and Entrez. For matching database sequences, annotated features are superimposed on the aligned query sequence in the output, thus greatly increasing the ease of interpretation. Such features may be used for automated annotation of new sequence because PowerBLAST output in ASN.1 form may be “dragged and dropped” into NCBI’s Sequin program for sequence annotation and submission. PowerBLAST is capable of analyzing and annotating a 100-kb query in 60 min on NCBI’s BLAST server.[THC BLAST is available athttp://www.ncbi.nlm.nih.gov/cgi-bin/THCBlast/nph-thcblast]

show abstract

“…However, since their complexities are quadratic with respect to the length of the two sequences this approach leads to a high computing time. One frequently used approach to speed up this time consuming operation is to introduce heuristics to the alignment algorithm [14,15]. The main drawback of this solution is that the more time efficient the heuristics, the worse is the quality of the result [16].…”

Section: Introductionmentioning

confidence: 99%

Parallelizing and Analyzing the Behavior of Sequence Alignment Algorithm on a Cluster of Workstations for Large Datasets

R¹,

Shrimankar²

2013

IJCA

View full text Add to dashboard Cite

An MPI based parallelization technique for improving the scalability of the global sequence alignment algorithm on clusters of workstation is presented. We propose the parallel implementation of the Wavefront algorithm based on a chunk size transformation to handle large dataset with message passing model. Molecular biologists frequently align DNA sequences of entire genomes to detect important matched and mismatched regions. Even though efficient dynamic programming algorithms exist for this problem, the required computing time is still very high due to the size of these sequences. Because the number of sequenced organisms is increasing rapidly, fast and accurate solutions are of highest importance to research in this area. We show that an appropriate choice of the number of processes and chunk size has great impact on the overall system performance on cluster system. We have conducted the experiments on real-life DNA samples of house mouse mitochondrion and the DNA of rabbit mitochondrion obtained from the public database GenBank [GenBank, http://www.ncbi.nih.gov] in our experiment to measure the algorithm behavior appropriately. The results obtained from performed experiments, demonstrate that developed parallel Wavefront algorithm exposes high speedup and scales linearly with the increasing number of processes. Also the communication among processes and memory requirements are kept at minimum to achieve high efficiency. The experiments were performed on cluster which consists of two workstations of 12 core each with multithreading environment.

show abstract

A local alignment tool for very long DNA sequences

Cited by 35 publications

References 19 publications

Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences.

Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences.

PowerBLAST: A New Network BLAST Application for Interactive or Automated Sequence Analysis and Annotation

Parallelizing and Analyzing the Behavior of Sequence Alignment Algorithm on a Cluster of Workstations for Large Datasets

Contact Info

Product

Resources

About