The CMVE software is available upon request from the authors.
BackgroundThe use of ontologies to control vocabulary and structure annotation has added value to genome-scale data, and contributed to the capture and re-use of knowledge across research domains. Gene Ontology (GO) is widely used to capture detailed expert knowledge in genomic-scale datasets and as a consequence has grown to contain many terms, making it unwieldy for many applications. To increase its ease of manipulation and efficiency of use, subsets called GO slims are often created by collapsing terms upward into more general, high-level terms relevant to a particular context. Creation of a GO slim currently requires manipulation and editing of GO by an expert (or community) familiar with both the ontology and the biological context. Decisions about which terms to include are necessarily subjective, and the creation process itself and subsequent curation are time-consuming and largely manual.ResultsHere we present an objective framework for generating customised ontology slims for specific annotated datasets, exploiting information latent in the structure of the ontology graph and in the annotation data. This framework combines ontology engineering approaches, and a data-driven algorithm that draws on graph and information theory. We illustrate this method by application to GO, generating GO slims at different information thresholds, characterising their depth of semantics and demonstrating the resulting gains in statistical power.ConclusionsOur GO slim creation pipeline is available for use in conjunction with any GO-annotated dataset, and creates dataset-specific, objectively defined slims. This method is fast and scalable for application to other biomedical ontologies.
http://pprowler.itee.uq.edu.au/NucImport
BACKGROUND & AIMS: Colorectal cancer (CRC) incidence at ages younger than 50 years is increasing, leading to proposals to lower the CRC screening initiation age to 45 years. Data on the effectiveness of CRC screening at ages 45-49 years are lacking. METHODS: We studied the association between undergoing colonoscopy at ages 45-49 or 50-54 years and CRC incidence in a retrospective population-based cohort study using Florida's linked Healthcare Cost and Utilization Project databases with mandated reporting from 2005 to 2017 and Cox models extended for time-varying exposure. RESULTS: Among 195,600 persons with and 2.6 million without exposure to colonoscopy at ages 45-49 years, 276 and 4844 developed CRC, resulting in CRC incidence rates of 20.8 (95% CI, 18.5-23.4) and 30.6 (95% CI, 29.8-31.5) per 100,000 person-years, respectively. Among 660,248 persons with and 2.4 million without exposure to colonoscopy at ages 50-54 years, 798 and 6757 developed CRC, resulting in CRC incidence rates of 19.0 (95% CI, 17.7-20.4) and 51.9 (95% CI, 50.7-53.1) per 100,000 person-years, respectively. The adjusted hazard ratios for incident CRC after undergoing compared with not undergoing colonoscopy were 0.50 (95% CI, 0.44-0.56) at ages 45-49 years and 0.32 (95% CI, 0.29-0.34) at ages 50-54 years. The results were similar for women and men (hazard ratio, 0.48; 95% CI, 0.40-0.57 and hazard ratio, 0.52; 95% CI, 0.43-0.62 at ages 45-49 years, and hazard ratio, 0.35; 95% CI, 0.31-0.39 and hazard ratio, 0.29; 95% CI, 0.26-0.32 at ages 50-54 years, respectively). CONCLUSIONS: Colonoscopy at ages 45-49 or 50-54 years was associated with substantial decreases in subsequent CRC incidence. These findings can inform screening guidelines.
http://bioinf.scmb.uq.edu.au/dlocalmotif/
Gene expression data is widely used in various post genomic analyses. The data is often probed using microarrays due to their ability to simultaneously measure the expressions of thousands of genes. The expression data, however, contains significant numbers of missing values, which can impact on subsequent biological analysis. To minimize the impact of these missing values, several imputation algorithms including Collateral Missing Value Estimation (CMVE), Bayesian Principal Component Analysis (BPCA), Least Square Impute (LSImpute), Local Least Square Impute (LLSImpute), and K-Nearest Neighbour (KNN) have been proposed. These algorithms, however, exploit either only the global or local correlation structure of the data, which normally can lead to higher estimation errors. This paper presents an Ameliorative Missing Value Imputation (AMVI) technique which has ability to exploit global/local and positive/negative correlations in a given dataset by automatic selection of the optimal number of predictor genes k using a wrapper non-parametric method based on Monte Carlo simulations. The AMVI technique has CMVE strategy at its core because CMVE has demonstrated improved performance compared to both low variance methods like BPCA, LLSImpute, and high variance methods such as KNN and ZeroImpute, as CMVE exploits positive/negative correlations. The performance of AMVI is compared with CMVE, BPCA, LLSImpute, and KNN by randomly removing between 1% and 15% missing values in eight different ovarian, breast cancer and yeast datasets. Together with the standard NRMS error metric, the True Positive (TP) rate of the significant genes selection, biological significance of the selected genes and the statistical significance test results are presented to investigate the impact of missing values on subsequent biological analysis. The enhanced performance of AMVI was demonstrated by its lower NRMS error, improved TP rate, bio significance of the selected genes and statistical significance test results, when compared with the aforementioned imputation methods across all the datasets. The results show that AMVI adapted to the latent correlation structure of the data and proved to be an effective and robust approach compared with the trial and error methodology for selecting k. The results confirmed that AMVI can be successfully applied to accurately impute missing values prior to any microarray data analysis.
Microarray data often contains multiple missing genetic expression values that degrade the performance of statistical and machine learning algorithms. This paper presents a K ranked diagonal covariance-based missing value estimation algorithm (KRCOV) that has demonstrated significantly superior performance compared to the more commonly used K-nearest neighbour (KNN) imputation algorithm when it is applied to estimate missing values of BRCA1, BRCA2 and Sporadic genetic mutation samples present in ovarian cancer. Experimental results confirm KRCOV outperformed both KNN and zero imputation techniques in terms of their classification accuracies when used to impute randomly missing values from 1% to 5%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.