One of the major challenges in bioinformatics is the development of efficient computational algorithms for biological sequence motif discovery. In the post-genomic era, the ability to predict the behavior, the function, or the structure of biological entities or motifs such as genes and proteins, as well as interactions among them, play a fundamental role in the discovery of information to help explain biological mechanisms. This necessitated the development of computational methods for identifying these entities. Consequently, a large number of motif finding algorithms have been implemented and applied to various organisms over the past decade. This paper presents a comparative analysis of the latest developments in motif finding algorithms and proposed an algorithm for motif discovery based on a combinatorial approach of pattern driven and statistical based approach. The proposed algorithm, Suffix Tree Gene Enrichment Motif Searching (STGEMS) as reported in [30], proved effective in identifying motifs from organisms with peculiarity in their genomic structure such as the AT-rich sequence of the malaria parasite, P. falciparum. The empirical time analysis of seven motif discovery algorithms was evaluated using four sets of genes from the intraerythrocytic development cycle of P. falciparum. The result shows that algorithms based on a combinatorial approach are more desirable.
The Coronavirus Disease 2019 (COVID-19) pandemic has catalyzed the expectations for technology-enhanced interactions with personalized educational materials. Adjusting the content of educational materials to the geographical location of a learner is a customization feature of personalized education and is used to develop the interest of a learner in the content. The educational content of interest in this report is bioinformatics, in which the knowledge spans biological science and applied mathematics disciplines. The Human Heredity and Health in Africa (H3Africa) Initiative is a resource suitable for use when obtaining data and peer-reviewed scholarly articles, which are geographically relevant and focus on authentic problem solving in the human health domain. We developed a computerized platform of interactive visual representations of curated bioinformatics datasets from H3Africa projects, which also supports customization, individualization and adaptation features of personalized education. We obtained evidence for the positive effect size and acceptable usability of a visual analytics resource designed for the retrieval-based learning of facts on functional impacts of genomic sequence variants. We conclude that technology-enhanced personalized bioinformatics educational interventions have implications in (1) the meaningful learning of bioinformatics; (2) stimulating additional student interest in bioinformatics; and (2) improving the accessibility of bioinformatics education to non-bioinformaticians.
The security of any system is a key factor toward its acceptability by the general public. We propose an intuitive approach to fraud detection in financial institutions using machine learning by designing a Hybrid Credit Card Fraud Detection (HCCFD) system which uses the technique of anomaly detection by applying genetic algorithm and multivariate normal distribution to identify fraudulent transactions on credit cards. An imbalance dataset of credit card transactions was used to the HCCFD and a target variable which indicates whether a transaction is deceitful or otherwise. Using F-score as performance metrics, the model was tested and it gave a prediction accuracy of 93.5%, as against artificial neural network, decision tree and support vector machine, which scored 84.2%, 80.0% and 68.5% respectively, when trained on the same data set. The results obtained showed a significant improvement as compared with the other widely used algorithms.
Understanding the interrelationship among genes in a cellular system is fundamental to the investigation of cellular activities, because the interrelated genes are either functionally related, controlled by the same transcriptional regulatory process or generally take part in a common biological process, and most importantly are known to be co-expressed genes. Most latent Mtb genes have been discovered but their functions, interrelationship and correlations that will help to develop protocol (s) to tame the menace of tuberculosis disease at latency have not been fully uncovered. We have developed a computational technique called Fuzzified Adjusted Rand Index (FARI) to effectively discover the co-expressed genes from identified latent Mtb genes and perform functional analysis of the gene sets using an annotation database. FARI, a modification of Adjusted Rand index used to compare clustering results, is designed to analyze, establish and quantify the expression trend of two genes with different sample points. Rank matrix of all the genes in consideration is produced after each gene has been analyzed with others, and the rank matrix serves as the basis of the co-expression discovery. A synthetic gene expression dataset, the biological benchmark dataset (E. coli), and different set of genes containing latent Mtb genes from an experiment result were fed into the computational tool, and different gene sets (modules) representing co-expressed genes were discovered. The discovered gene modules from latent Mtb genes are used to uncover the hub genes and their molecular functions. We have been able to identify different co-expression network from this analysis and assign biological functional meanings to some of the important Mtb genes that emerge from the experiment. Also, discovering gene co-expression module births gene co-expression network, which is a preliminary step towards gene regulatory network discovery.
Clustering is one of the fundamental processes of analyzing gene expression data, basically by comparing gene expression profiles or sample expression profiles. Comparing expression profiles requires a measure apart from the actual clustering algorithm to quantify how similar or dissimilar the objects under consideration are. Various clustering algorithms have been used to analyze gene expression data. Some of these algorithms reported the incorporation of similarity measures like Euclidean Distance, Pearson Correlation and mutual information for their performance. This work considered different reported clustering algorithms for gene expression data analyses and the importance of different similarity measures for optimizing these clustering algorithms. To this end, no clustering technique in all the works investigated has been applied directly on gene expression data. It is observed that the output (distance matrix) of similarity or dissimilarity measures plays the role of input to clustering techniques, and those that did not use any of the popular proximity measures applied one or two approaches such as Constrained Coherency (CoCo), Silhouette coefficient measurement, and normalization and discretization, to refine gene expression data for improved cluster quality by speeding up the learning phase, reduction of computational space and handling of noise effectively.
Background: The computational reconstruction of Gene Regulatory Networks (GRNs) using different techniques have encountered the challenge of constructing large network because of many parameters to be fitted and the nature of the input data. In fact, contemporary works on GRN inference that involve the use of hybridized techniques especially Artificial Neural Network (ANN) with meta-heuristic optimization techniques have to trade off computational cost for accuracy in reconstructing large-scale GRN. This work designed an efficient feature selection algorithm with GRN model to overcome the dimension problem of input data using biological prior knowledge of co-expression and network sparseness, so as to capture and represent the actual interrelationship among genes. Methodology: The GRN model is an ensemble Multi-Layer Perceptron (MLP) network incorporating a novel feature selection algorithm termed Fuzzified Adjusted Rand Index (FARI). FARI is developed to investigate and establish the expression trends of genes in an expression profile data. A rank matrix of all genes produced by FARI shows their coexpression relationship, which is used to coordinate the selection of potential predictors as input features into the inference model. Each target gene is modeled separately by updating its parameters independently as several subproblems of the overall network. The performance of the model is subjected to synthetic, ecoli and Mtb data. Result: The result indicated an improved accuracy in the construction of large-scale GRN including a significant speedup. The result on Mtb identified CCL5 as the first expressed gene, which is the same with CCL1 identified by the experimental method. Some of the expressed genes were validated through their biological pathways showing immune responses and host susceptibility to TB. Conclusion: The included prior biological knowledge in MLP model provided the construction of an accurate large-scale GRN by reducing the potential large search space of GRN modeling. Besides, the model produced two major biological networks from the same process using the same dataset for appropriate biological validation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.