Jieping Zhao scite author profile

Biological sequences are composed of long strings of alphabetic letters rather than arrays of numerical values. Lack of a natural underlying metric for comparing such alphabetic data significantly inhibits sophisticated statistical analyses of sequences, modeling structural and functional aspects of proteins, and related problems. Herein, we use multivariate statistical analyses on almost 500 amino acid attributes to produce a small set of highly interpretable numeric patterns of amino acid variability. These high-dimensional attribute data are summarized by five multidimensional patterns of attribute covariation that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. Numerical scores for each amino acid then transform amino acid sequences for statistical analyses. Relationships between transformed data and amino acid substitution matrices show significant associations for polarity and codon diversity scores. Transformed alphabetic data are used in analysis of variance and discriminant analysis to study DNA binding in the basic helix-loop-helix proteins. The transformed scores offer a general solution for analyzing a wide variety of sequence analysis problems.basic helix-loop-helix ͉ molecular evolution ͉ multivariate statistics ͉ amino acid attributes ͉ factor analysis

show abstract

Molecular Architecture of the DNA-Binding Region and Its Relationship to Classification of Basic Helix-Loop-Helix Proteins

Atchley¹,

Zhao²

2006

Molecular Biology and Evolution

View full text Add to dashboard Cite

Multivariate statistical analyses are used to explore the molecular architecture of the DNA-binding and dimerization regions of basic helix-loop-helix (bHLH) proteins. Alphabetic amino acid data are transformed to biologically meaningful quantitative values using a set of 5 multivariate "indices." These multivariate indices summarize variation in a large suite of amino acid physiochemical attributes and reflect variability in polarity-accessibility-hydrophobicity, propensity for secondary structure, molecular size, codon composition, and electrostatic charge. Using these index score data, discriminant analyses describe the multidimensional aspects of physiochemical variation and clarify the structural basis of the prevailing evolutionary classification of bHLH proteins. A small number of amino acids from both the binding dimerization domains, when considered simultaneously, accurately distinguish the 5 known DNA-binding groups. The relevant sites often have well-documented structural and functional characteristics.

show abstract

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

Sharma

Podolsky

Zhao

et al. 2009

View full text Add to dashboard Cite

Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1).Availability: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx.Contact: rmcindoe@mail.mcg.eduSupplementary information: Supplementary data are available at Bioinformatics online.

show abstract

Proton pump inhibitor-induced risk of chronic kidney disease is associated with increase of indoxyl sulfate synthesis via inhibition of CYP2E1 protein degradation

Zhao

Chen

et al. 2022

Chemico-Biological Interactions

View full text Add to dashboard Cite

ParaSAM: a parallelized version of the significance analysis of microarrays algorithm

Sharma

Zhao

Podolsky

et al. 2010

View full text Add to dashboard Cite

Motivation: Significance analysis of microarrays (SAM) is a widely used permutation-based approach to identifying differentially expressed genes in microarray datasets. While SAM is freely available as an Excel plug-in and as an R-package, analyses are often limited for large datasets due to very high memory requirements.Summary: We have developed a parallelized version of the SAM algorithm called ParaSAM to overcome the memory limitations. This high performance multithreaded application provides the scientific community with an easy and manageable client-server Windows application with graphical user interface and does not require programming experience to run. The parallel nature of the application comes from the use of web services to perform the permutations. Our results indicate that ParaSAM is not only faster than the serial version, but also can analyze extremely large datasets that cannot be performed using existing implementations.Availability:A web version open to the public is available at http://bioanalysis.genomics.mcg.edu/parasam. For local installations, both the windows and web implementations of ParaSAM are available for free at http://www.amdcc.org/bioinformatics/software/parasam.aspxContact: rmcindoe@mail.mcg.eduSupplementary information: Supplementary Data is available at Bioinformatics online.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jieping Zhao

Solving the protein sequence metric problem

Molecular Architecture of the DNA-Binding Region and Its Relationship to Classification of Basic Helix-Loop-Helix Proteins

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

Proton pump inhibitor-induced risk of chronic kidney disease is associated with increase of indoxyl sulfate synthesis via inhibition of CYP2E1 protein degradation

ParaSAM: a parallelized version of the significance analysis of microarrays algorithm

Contact Info

Product

Resources

About