A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between design principles and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students new to research and current researchers who are new to data-intensive work.
Functional, usable, and maintainable open-source software is increasingly essential to scientific research, but there is a large variation in formal training for software development and maintainability. Here, we propose 10 “rules” centered on 2 best practice components: clean code and testing. These 2 areas are relatively straightforward and provide substantial utility relative to the learning investment. Adopting clean code practices helps to standardize and organize software code in order to enhance readability and reduce cognitive load for both the initial developer and subsequent contributors; this allows developers to concentrate on core functionality and reduce errors. Clean coding styles make software code more amenable to testing, including unit tests that work best with modular and consistent software code. Unit tests interrogate specific and isolated coding behavior to reduce coding errors and ensure intended functionality, especially as code increases in complexity; unit tests also implicitly provide example usages of code. Other forms of testing are geared to discover erroneous behavior arising from unexpected inputs or emerging from the interaction of complex codebases. Although conforming to coding styles and designing tests can add time to the software development project in the short term, these foundational tools can help to improve the correctness, quality, usability, and maintainability of open-source scientific software code. They also advance the principal point of scientific research: producing accurate results in a reproducible way. In addition to suggesting several tips for getting started with clean code and testing practices, we recommend numerous tools for the popular open-source scientific software languages Python, R, and Julia.
Integrated Assessment Models (IAMs) have become critical tools for assessing the costs and benefits of policies to reduce greenhouse gas emissions. Three models currently inform the social cost of carbon dioxide (SCCO2, the net present value of damages from one additional ton of CO2) used by the US federal government, several states, and Canada. Here we present a new open-source implementation of one of these models (PAGE09) in the Julia programming language using a modular modeling framework (Mimi). Mimi-PAGE was coded using best coding practices (such as multiple code reviews by different individuals during development, automated testing of newly-committed code, and provision of documentation and usage notes) and is publicly available in a GitHub repository for community inspection and use under an open source license. In this paper we describe the Julia implementation of PAGE09, show that output from Mimi-PAGE matches that of the original model, and perform comparisons of the run time between the two implementations.
This study assesses the incorporation of health impacts in economic models of climate change. Improving the health functions in integrated assessment models will lead to a more accurate estimation of the social cost of carbon. Socioeconomic factors modify the interaction between climate and health and should be considered in future updates of integrated assessment models.
Mosquito-borne diseases such as malaria continue to pose a major global health burden, and the impact of currently-available interventions is stagnating. Consequently, there is interest in novel tools to control these diseases, including gene drive-modified mosquitoes. As these tools continue to be refined, decisions on whether to implement them in the field depend on their alignment with target product profiles (TPPs) that define product characteristics required to achieve desired entomological and epidemiological outcomes. TPPs are increasingly being used for malaria and vector control interventions, such as attractive targeted sugar baits and long-acting injectable drugs, as they progress through the development pipeline. For mosquito gene drive products, reliable predictions from mathematical models are an essential part of these analyses, as field releases could potentially be irreversible. Here, we review the prior use of mathematical models in developing TPPs for malaria and vector control tools and discuss lessons from these analyses that may apply to mosquito gene drives. We recommend that, as gene drive technology gets closer to field release, discussions regarding target outcomes engage a wide range of stakeholders and account for settings of interest and vector species present. Given the relatively large number of parameters that describe gene drive products, machine learning approaches may be useful to explore parameter space, and an emphasis on conservative fitness estimates is advisable, given the difficulty of accurately measuring these parameters prior to field studies. Modeling may also help to inform the risk, remediation and cost dimensions of mosquito gene drive TPPs.
As anthropogenic factors contribute to the introduction and expansion of new and established vector species, the geographic incidence of mosquito-borne disease is shifting. Computer simulations, informed by field data where possible, facilitate the cost-effective evaluation of available public health interventions and are a powerful tool for informing appropriate policy action. However, a variety of measurements are used in such assessments; this can complicate direct comparisons across both vector control technologies and the models used to simulate them. The expansion of biocontrol to include genetically engineered organisms is now prompting additional metrics with no analogy to traditional measurement approaches. We propose Standard Entomological Metrics (SEMs) to facilitate the model-based appraisal of both existing and novel intervention tools and define two examples: Suppression Efficacy Score and Time to Reduction Target. We formulate twelve synthetic case studies featuring two vector control technologies over three years of observed daily temperature in Cairns, Australia. After calculating Suppression Efficacy Score and Time to Reduction Target results, we apply these example outcomes to a discussion of health policy decision-making using SEMs. We submit that SEMs such as Suppression Efficacy Score and Time to Reduction Target facilitate the wholistic and environmentally appropriate simulation-based evaluation of intervention programs and invite the community to further discussion on this topic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.