BackgroundThere is growing interest in using machine learning approaches to priority rank studies and reduce human burden in screening literature when conducting systematic reviews. In addition, identifying addressable questions during the problem formulation phase of systematic review can be challenging, especially for topics having a large literature base. Here, we assess the performance of the SWIFT-Review priority ranking algorithm for identifying studies relevant to a given research question. We also explore the use of SWIFT-Review during problem formulation to identify, categorize, and visualize research areas that are data rich/data poor within a large literature corpus.MethodsTwenty case studies, including 15 public data sets, representing a range of complexity and size, were used to assess the priority ranking performance of SWIFT-Review. For each study, seed sets of manually annotated included and excluded titles and abstracts were used for machine training. The remaining references were then ranked for relevance using an algorithm that considers term frequency and latent Dirichlet allocation (LDA) topic modeling. This ranking was evaluated with respect to (1) the number of studies screened in order to identify 95 % of known relevant studies and (2) the “Work Saved over Sampling” (WSS) performance metric. To assess SWIFT-Review for use in problem formulation, PubMed literature search results for 171 chemicals implicated as EDCs were uploaded into SWIFT-Review (264,588 studies) and categorized based on evidence stream and health outcome. Patterns of search results were surveyed and visualized using a variety of interactive graphics.ResultsCompared with the reported performance of other tools using the same datasets, the SWIFT-Review ranking procedure obtained the highest scores on 11 out of 15 of the public datasets. Overall, these results suggest that using machine learning to triage documents for screening has the potential to save, on average, more than 50 % of the screening effort ordinarily required when using un-ordered document lists. In addition, the tagging and annotation capabilities of SWIFT-Review can be useful during the activities of scoping and problem formulation.ConclusionsText-mining and machine learning software such as SWIFT-Review can be valuable tools to reduce the human screening burden and assist in problem formulation.Electronic supplementary materialThe online version of this article (doi:10.1186/s13643-016-0263-z) contains supplementary material, which is available to authorized users.
Changes in gene expression can help reveal the mechanisms of disease processes and the mode of action for toxicities and adverse effects on cellular responses induced by exposures to chemicals, drugs and environment agents. The U.S. Tox21 Federal collaboration, which currently quantifies the biological effects of nearly 10,000 chemicals via quantitative high-throughput screening(qHTS) in in vitro model systems, is now making an effort to incorporate gene expression profiling into the existing battery of assays. Whole transcriptome analyses performed on large numbers of samples using microarrays or RNA-Seq is currently cost-prohibitive. Accordingly, the Tox21 Program is pursuing a high-throughput transcriptomics (HTT) method that focuses on the targeted detection of gene expression for a carefully selected subset of the transcriptome that potentially can reduce the cost by a factor of 10-fold, allowing for the analysis of larger numbers of samples. To identify the optimal transcriptome subset, genes were sought that are (1) representative of the highly diverse biological space, (2) capable of serving as a proxy for expression changes in unmeasured genes, and (3) sufficient to provide coverage of well described biological pathways. A hybrid method for gene selection is presented herein that combines data-driven and knowledge-driven concepts into one cohesive method. Our approach is modular, applicable to any species, and facilitates a robust, quantitative evaluation of performance. In particular, we were able to perform gene selection such that the resulting set of “sentinel genes” adequately represents all known canonical pathways from Molecular Signature Database (MSigDB v4.0) and can be used to infer expression changes for the remainder of the transcriptome. The resulting computational model allowed us to choose a purely data-driven subset of 1500 sentinel genes, referred to as the S1500 set, which was then augmented using a knowledge-driven selection of additional genes to create the final S1500+ gene set. Our results indicate that the sentinel genes selected can be used to accurately predict pathway perturbations and biological relationships for samples under study.
We report the results of a genome-wide analysis of transcription in Arabidopsis thaliana after treatment with Pseudomonas syringae pathovar tomato. Our time course RNA-Seq experiment uses over 500 million read pairs to provide a detailed characterization of the response to infection in both susceptible and resistant hosts. The set of observed differentially expressed genes is consistent with previous studies, confirming and extending existing findings about genes likely to play an important role in the defense response to Pseudomonas syringae. The high coverage of the Arabidopsis transcriptome resulted in the discovery of a surprisingly large number of alternative splicing (AS) events – more than 44% of multi-exon genes showed evidence for novel AS in at least one of the probed conditions. This demonstrates that the Arabidopsis transcriptome annotation is still highly incomplete, and that AS events are more abundant than expected. To further refine our predictions, we identified genes with statistically significant changes in the ratios of alternative isoforms between treatments. This set includes several genes previously known to be alternatively spliced or expressed during the defense response, and it may serve as a pool of candidate genes for regulated alternative splicing with possible biological relevance for the defense response against invasive pathogens.
The number and wide dynamic range of components found in biological matrixes present several challenges for global proteomics. In this perspective, we will examine the potential of zero-dimensional (0D), one-dimensional (1D), and two-dimensional (2D) separations coupled with Fourier-transform ion cyclotron resonance (FT-ICR) and time-of-flight (TOF) mass spectrometry (MS) for the analysis of complex mixtures. We describe and further develop previous reports on the space occupied by peptides, to calculate the theoretical peak capacity available to each separations-mass spectrometry method examined. Briefly, the peak capacity attainable by each of the mass analyzers was determined from the mass resolving power (RP) and the m/z space occupied by peptides considered from the mass distribution of tryptic peptides from National Center for Biotechnology Information's (NCBI's) nonredundant database. Our results indicate that reverse-phase-nanoHPLC (RP-nHPLC) separation coupled with FT-ICR MS offers an order of magnitude improvement in peak capacity over RP-nHPLC separation coupled with TOF MS. The addition of an orthogonal separation method, strong cation exchange (SCX), for 2D LC-MS demonstrates an additional 10-fold improvement in peak capacity over 1D LC-MS methods. Peak capacity calculations for 0D LC, two different 1D RP-HPLC methods, and 2D LC (with various numbers of SCX fractions) for both RP-HPLC methods coupled to FT-ICR and TOF MS are examined in detail. Peak capacity production rates, which take into account the total analysis time, are also considered for each of the methods. Furthermore, the significance of the space occupied by peptides is discussed.
Background: In the screening phase of systematic review, researchers use detailed inclusion/exclusion criteria to decide whether each article in a set of candidate articles is relevant to the research question under consideration. A typical review may require screening thousands or tens of thousands of articles in and can utilize hundreds of person-hours of labor. Methods: Here we introduce SWIFT-Active Screener, a web-based, collaborative systematic review software application, designed to reduce the overall screening burden required during this resource-intensive phase of the review process. To prioritize articles for review, SWIFT-Active Screener uses active learning, a type of machine learning that incorporates user feedback during screening. Meanwhile, a negative binomial model is employed to estimate the number of relevant articles remaining in the unscreened document list. Using a simulation involving 26 diverse systematic review datasets that were previously screened by reviewers, we evaluated both the document prioritization and recall estimation methods. Results: On average, 95% of the relevant articles were identified after screening only 40% of the total reference list. In the 5 document sets with 5,000 or more references, 95% recall was achieved after screening only 34% of the available references, on average. Furthermore, the recall estimator we have proposed provides a useful, conservative estimate of the percentage of relevant documents identified during the screening process. Conclusion: SWIFT-Active Screener can result in significant time savings compared to traditional screening and the savings are increased for larger project sizes. Moreover, the integration of explicit recall estimation during screening solves an important challenge faced by all machine learning systems for document screening: when to stop screening a prioritized reference list. The software is currently available in the form of a multi-user, collaborative, online web application.
Eukaryotic marine microalgae like Dunaliella spp. have great potential as a feedstock for liquid transportation fuels because they grow fast and can accumulate high levels of triacylgycerides with little need for fresh water or land. Their growth rates vary between species and are dependent on environmental conditions. The cell cycle, starch and triacylglycerol accumulation are controlled by the diurnal light:dark cycle. Storage compounds like starch and triacylglycerol accumulate in the light when CO2 fixation rates exceed the need of assimilated carbon and energy for cell maintenance and division during the dark phase. To delineate environmental effects, we analyzed cell division rates, metabolism and transcriptional regulation in Dunaliella viridis in response to changes in light duration and growth temperatures. Its rate of cell division was increased under continuous light conditions, while a shift in temperature from 25°C to 35°C did not significantly affect the cell division rate, but increased the triacylglycerol content per cell several-fold under continuous light. The amount of saturated fatty acids in triacylglycerol fraction was more responsive to an increase in temperature than to a change in the light regime. Detailed fatty acid profiles showed that Dunaliella viridis incorporated lauric acid (C12:0) into triacylglycerol after 24 hours under continuous light. Transcriptome analysis identified potential regulators involved in the light and temperature-induced lipid accumulation in Dunaliella viridis.
In dynamic topic modeling, the proportional contribution of a topic to a document depends on the temporal dynamics of that topic's overall prevalence in the corpus. We extend the Dynamic Topic Model of Blei and Lafferty (2006) by explicitly modeling document-level topic proportions with covariates and dynamic structure that includes polynomial trends and periodicity. A Markov Chain Monte Carlo (MCMC) algorithm that utilizes Polya-Gamma data augmentation is developed for posterior inference. Conditional independencies in the model and sampling are made explicit, and our MCMC algorithm is parallelized where possible to allow for inference in large corpora. To address computational bottlenecks associated with Polya-Gamma sampling, we appeal to the Central Limit Theorem to develop a Gaussian approximation to the Polya-Gamma random variable. This approximation is fast and reliable for parameter values relevant in the text-mining domain. Our model and inference algorithm are validated with multiple simulation examples, and we consider the application of modeling trends in PubMed abstracts. We demonstrate that sharing information across documents is critical for accurately estimating document-specific topic proportions. We also show that explicitly modeling polynomial and periodic behavior improves our ability to predict topic prevalence at future time points.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.