The
sequence database searching method is widely used in proteomics
for peptide identification. To control the false discovery rate (FDR)
of the searching results, the target–decoy method generates
and searches a decoy database together with the target database. A
known problem is that the target protein sequence database may contain
numerous repeated peptides. The structures of these repeats are not
preserved by most existing decoy generation algorithms. Previous studies
suggest that such discrepancy between the target and decoy databases
may lead to an inaccurate FDR estimation. Based on the de Bruijn graph
model, we propose a new repeat-preserving algorithm to generate decoy
databases. We prove that this algorithm preserves the structures of
the repeats in the target database to a great extent. The de Bruijn
method has been compared with a few other commonly used methods and
demonstrated superior FDR estimation accuracy and an improved number
of peptide identification.
BackgroundDevelopment of biologically relevant models from gene expression data notably, microarray data has become a topic of great interest in the field of bioinformatics and clinical genetics and oncology. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnoses.MethodsThis study presents a modified Artificial Bee Colony Algorithm (ABC) to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is believed to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating the concept of pheromones which is one of the major components of Ant Colony Optimization (ACO) algorithm and a new operation in which successive bees communicate to share their findings.ResultsThe proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are tuned scientifically with one of the datasets. Obtained results are compared to other works that used the same datasets. The performance of the proposed method is proved to be superior.ConclusionThe method presented in this paper can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller. Additionally, the proposed modified Artificial Bee Colony Algorithm could conceivably be applied to problems in other areas as well.Electronic supplementary materialThe online version of this article (doi:10.1186/s12920-016-0204-7) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.