By reducing amino acid alphabet, the protein complexity can be significantly simplified, which could improve computational efficiency, decrease information redundancy and reduce chance of overfitting. Although some reduced alphabets have been proposed, different classification rules could produce distinctive results for protein sequence analysis. Thus, it is urgent to construct a systematical frame for reduced alphabets. In this work, we constructed a comprehensive web server called RAACBook for protein sequence analysis and machine learning application by integrating reduction alphabets. The web server contains three parts: (i) 74 types of reduced amino acid alphabet were manually extracted to generate 673 reduced amino acid clusters (RAACs) for dealing with unique protein problems. It is easy for users to select desired RAACs from a multilayer browser tool. (ii) An online tool was developed to analyze primary sequence of protein. The tool could produce K-tuple reduced amino acid composition by defining three correlation parameters (K-tuple, g-gap, λ-correlation). The results are visualized as sequence alignment, mergence of RAA composition, feature distribution and logo of reduced sequence. (iii) The machine learning server is provided to train the model of protein classification based on K-tuple RAAC. The optimal model could be selected according to the evaluation indexes (ROC, AUC, MCC, etc.). In conclusion, RAACBook presents a powerful and user-friendly service in protein sequence analysis and computational proteomics. RAACBook can be freely available at http://bioinfor.imu.edu.cn/raacbook. Database URL: http://bioinfor.imu.edu.cn/raacbook
Sequence logos give a fast and concise display in visualizing consensus sequence. Protein exhibits greater complexity and diversity than DNA, which usually affects the graphical representation of the logo. Reduced amino acids perform powerful ability for simplifying complexity of sequence alignment, which motivated us to establish RaacLogo. As a new sequence logo generator by using reduced amino acid alphabets, RaacLogo can easily generate many different simplified logos tailored to users by selecting various reduced amino acid alphabets that consisted of more than 40 clustering algorithms. This current web server provides 74 types of reduced amino acid alphabet, which were manually extracted to generate 673 reduced amino acid clusters (RAACs) for dealing with protein alignment. A two-dimensional selector was proposed for easily selecting desired RAACs with underlying biology knowledge. It is anticipated that the RaacLogo web server will play more high-potential roles for protein sequence alignment, topological estimation and protein design experiments. RaacLogo is freely available at http://bioinfor.imu.edu.cn/raaclogo.
Defensins as 1 of major classes of host defense peptides play a significant role in the innate immunity, which are extremely evolved in almost all living organisms. Developing high-throughput computational methods can accurately help in designing drugs or medical means to defense against pathogens. To take up such a challenge, an up-to-date server based on rigorous benchmark dataset, referred to as iDEF-PseRAAC, was designed for predicting the defensin family in this study. By extracting primary sequence compositions based on different types of reduced amino acid alphabet, it was calculated that the best overall accuracy of the selected feature subset was achieved to 92.38%. Therefore, we can conclude that the information provided by abundant types of amino acid reduction will provide efficient and rational methodology for defensin identification. And, a free online server is freely available for academic users at http://bioinfor.imu.edu.cn/idpf . We hold expectations that iDEF-PseRAAC may be a promising weapon for the function annotation about the defensins protein.
Understanding early development offers a striking opportunity to investigate genetic disease, stem cell and assisted reproductive technology. Recent advances in high-throughput sequencing technology have led to the rising influx of omics data, which have rapidly boosted our understanding of mammalian developmental mechanisms. Here, we review the database EmExplorer (a database for exploring time activation of gene expression in mammalian embryos), which systematically organizes the genes from development-related pathways, and which we have already established and continue to update it. The current version of EmExplorer incorporates over 26 000 genes obtained from 306 functional pathways in five species. The function annotations of development-related genes were also integrated into EmExplorer. To facilitate data extraction, the database also contains the following information. (i) The dynamic expression values for each development stage are matched to the corresponding genes. (ii) A two-layer search tool which supports multi-option searching, such as by official symbol, pathway name and function annotation. The returned entries can directly link to the analysis results for the corresponding gene or pathway in the analysis module. (iii) The analysis module provides different gene comparisons at the multi-species level and functional pathway level, which shows the species specificity and stage specificity at the gene or pathway level. (iv) The analysis based on the hypergeometric distribution test reveals the enrichment of gene functions at a particular stage of one organism's pathway. (v) The browser is designed for users with ambiguous searching goals and greatly helps new users to get a general idea of the contents of the database. (vi) The experimentally validated pathways are manually curated and shown on the home page. EmExplorer will be helpful for elucidating early developmental mechanisms and exploring time activation genes. EmExplorer is freely available at http://bioinfor.imu.edu.cn/emexplorer .
Human preimplantation development is a complex process involving dramatic changes in transcriptional architecture. For a better understanding of their time-spatial development, it is indispensable to identify key genes. Although the singlecell RNA sequencing (RNA-seq) techniques could provide detailed clustering signatures, the identification of decisive factors remains difficult. Additionally, it requires high experimental cost and a long experimental period. Thus, it is highly desired to develop computational methods for identifying effective genes of development signature. In this study, we first developed a predictor called EmPredictor to identify developmental stages of human preimplantation embryogenesis. First, we compared the F-score of feature selection algorithms with differential gene expression (DGE) analysis to find specific signatures of the development stage. In addition, by training the support vector machine (SVM), four types of signature subsets were comprehensively discussed. The prediction results showed that a feature subset with 1,881 genes from the F-score algorithm obtained the best predictive performance, which achieved the highest accuracy of 93.3% on the cross-validation set. Further function enrichment demonstrated that the gene set selected by the feature selection method was involved in more development-related pathways and cell fate determination biomarkers. This indicates that the F-score algorithm should be preferentially proposed for detecting key genes of multi-period data in mammalian early development.
Terminally differentiated somatic cells can be reprogrammed into a totipotent state through somatic cell nuclear transfer (SCNT). The incomplete reprogramming is the major reason for developmental arrest of SCNT embryos at early stages. In our studies, we found that pathways for autophagy, endocytosis, and apoptosis were incompletely activated in nuclear transfer (NT) 2-cell arrest embryos, whereas extensively inhibited pathways for stem cell pluripotency maintenance, DNA repair, cell cycle, and autophagy may result in NT 4cell embryos arrest. As for NT normal embryos, a significant shift in expression of developmental transcription factors (TFs) Id1, Pou6f1, Cited1, and Zscan4c was observed. Compared with pluripotent gene Ascl2 being activated only in NT 2-cell, Nanog, Dppa2, and Sall4 had major expression waves in normal development of both NT 2-cell and 4-cell embryos. Additionally, Kdm4b/4d and Kdm5b had been confirmed as key markers in NT 2-cell and 4-cell embryos, respectively. Histone acetylases Kat8, Elp6, and Eid1 were co-activated in NT 2-cell and 4-cell embryos to facilitate normal development. Gadd45a as a key driver functions with Tet1 and Tet2 to improve the efficiency of NT reprogramming. Taken together, our findings provided an important theoretical basis for elucidating the potential molecular mechanisms and identified reprogramming driver factor to improve the efficiency of SCNT reprogramming.
The emerging importance of embryonic development research rapidly increases the volume for a professional resource related to multi-omics data. However, the lack of global embryogenesis repository and systematic analysis tools limits the preceding in stem cell research, human congenital diseases and assisted reproduction. Here, we developed the EmAtlas, which collects the most comprehensive multi-omics data and provides multi-scale tools to explore spatiotemporal activation during mammalian embryogenesis. EmAtlas contains data on multiple types of gene expression, chromatin accessibility, DNA methylation, nucleosome occupancy, histone modifications, and transcription factors, which displays the complete spatiotemporal landscape in mouse and human across several time points, involving gametogenesis, preimplantation, even fetus and neonate, and each tissue involves various cell types. To characterize signatures involved in the tissue, cell, genome, gene and protein levels during mammalian embryogenesis, analysis tools on these five scales were developed. Additionally, we proposed EmRanger to deliver extensive development-related biological background annotations. Users can utilize these tools to analyze, browse, visualize, and download data owing to the user-friendly interface. EmAtlas is freely accessible at http://bioinfor.imu.edu.cn/ematlas.
Background: DNA methylation plays an important role in reprogramming process. Understanding the underlying molecular mechanism of reprogramming is crucial for answering fundamental questions regarding transition of cell identity. Method: In this study, based on the genome-wide DNA methylation data from different cell lines, the comparative methylation profiles were proposed to identify epigenetic signature of cell reprogramming. Results: The density profile of CpG methylation showed that pluripotent cells perform more polarized than human dermal fibroblasts (HDF) cells. The heterogeneity of iPS has a greater deviation in DNA hypermethylation pattern. The result of regional distribution showed that the differential CpG sites between pluripotent cells and HDFs tend to accumulate in gene body and CpG shelf regions, whereas the internal differential methylation CpG sites (DMCs) of three types of pluripotent cells prefer to accumulate in TSS1500 region. Furthermore, a series of endogenous markers of cell reprogramming were identified based on integrative analysis, including focal adhesion, pluripotency maintenance and transcription regulation. Calcium signaling pathway was detected as one of signatures between NT cells and iPS cells. At last, the regional bias of DNA methylation for key pluripotency factors was discussed. Our studies provide new insight into the barrier identification of cell reprogramming. Conclusion: Taken together, our studies have analyzed some epigenetic markers and barriers of nuclear reprogramming, hoping to provide a new insight into understanding the underlying molecular mechanism of reprogramming.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.