A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to utilize databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes. The goal of this paper is to motivate the importance of data science proficiency and to provide examples and resources for instructors to implement data science in their own statistics curricula. We provide case studies from seven institutions. These varied approaches to teaching data science demonstrate curricular innovations to address new needs. Also included here are examples of assignments designed for courses that foster engagement of undergraduates with data and data science.
The Park City Math Institute 2016 Summer Undergraduate Faculty Program met for the purpose of composing guidelines for undergraduate programs in data science. The group consisted of 25 undergraduate faculty from a variety of institutions in the United States, primarily from the disciplines of mathematics, statistics, and computer science. These guidelines are meant to provide some structure for institutions planning for or revising a major in data science.
Data science is an emerging interdisciplinary field that combines elements of mathematics, statistics, computer science, and knowledge in a particular application domain for the purpose of extracting meaningful information from the increasingly sophisticated array of data available in many settings. These data tend to be non-traditional, in the sense that they are often live, large, complex, and/or messy. A first course in statistics at the undergraduate level typically introduces students to a variety of techniques to analyze small, neat, and clean data sets.However, whether they pursue more formal training in statistics or not, many of these students will end up working with data that are considerably more complex, and will need facility with statistical computing techniques. More importantly, these students require a framework for thinking structurally about data. We describe an undergraduate course in a liberal arts environment that provides students with the tools necessary to apply data science. The course emphasizes modern, practical, and useful skills that cover the full data analysis spectrum, from asking an interesting question to acquiring, managing, manipulating, processing, querying, analyzing, and visualizing data, as well communicating findings in written, graphical, and oral forms.
BackgroundProper collection and storage of fecal samples is necessary to guarantee the subsequent reliability of DNA-based soil-transmitted helminth diagnostic procedures. Previous research has examined various methods to preserve fecal samples for subsequent microscopic analysis or for subsequent determination of overall DNA yields obtained following DNA extraction. However, only limited research has focused on the preservation of soil-transmitted helminth DNA in stool samples stored at ambient temperature or maintained in a cold chain for extended periods of time.MethodologyQuantitative real-time PCR was used in this study as a measure of the effectiveness of seven commercially available products to preserve hookworm DNA over time and at different temperatures. Results were compared against “no preservative” controls and the “gold standard” of rapidly freezing samples at -20°C. The preservation methods were compared at both 4°C and at simulated tropical ambient temperature (32°C) over a period of 60 days. Evaluation of the effectiveness of each preservative was based on quantitative real-time PCR detection of target hookworm DNA.ConclusionsAt 4°C there were no significant differences in DNA amplification efficiency (as measured by Cq values) regardless of the preservation method utilized over the 60-day period. At 32°C, preservation with FTA cards, potassium dichromate, and a silica bead two-step desiccation process proved most advantageous for minimizing Cq value increases, while RNA later, 95% ethanol and Paxgene also demonstrate some protective effect. These results suggest that fecal samples spiked with known concentrations of hookworm-derived egg material can remain at 4°C for 60 days in the absence of preservative, without significant degradation of the DNA target. Likewise, a variety of preservation methods can provide a measure of protection in the absence of a cold chain. As a result, other factors, such as preservative toxicity, inhibitor resistance, preservative cost, shipping requirements, sample infectivity, and labor costs should be considered when deciding upon an appropriate method for the storage of fecal specimens for subsequent PCR analysis. Balancing logistical factors and the need to preserve the target DNA, we believe that under most circumstances 95% ethanol provides the most pragmatic choice for preserving stool samples in the field.
Reproducibility is increasingly important to statistical research, but many details are often omitted from the published version of complex statistical analyses. A reader's comprehension is limited to what the author concludes, without exposure to the computational process. Often, the industrious reader cannot expand upon or validate the author's results. Even the author may struggle to reproduce their own results upon revisiting them. R Markdown is an authoring syntax that combines the ease of Markdown with the statistical programming language R. An R Markdown document or presentation interweaves computation, output and written analysis to the effect of transparency, clarity and an inherent invitation to reproduce (especially as sharing data is now as easy as the click of a button). It is an open-source tool that can be used either on its own or through the RStudio integrated development environment (IDE). In addition to facilitating reproducible research, R Markdown is a boon to collaboratively-minded data analysts, whose workflow can be streamlined by sharing only one master document that contains both code and content. Statistics educators may also find that R Markdown is helpful as a homework template, for both ease-of-use and in discouraging students from copy-and-pasting results from classmates. Training students in R Markdown will introduce to the workforce a new class of data analysts with an ingrained, foundational inclination toward reproducible research.Comment: 16 page
Many have argued that statistics students need additional facility to express statistical computations. By introducing students to commonplace tools for data management, visualization, and reproducible analysis in data science and applying these to real-world scenarios, we prepare them to think statistically. In an era of increasingly big data, it is imperative that students develop data-related capacities, beginning with the introductory course. We believe that the integration of these precursors to data science into our curricula-early and often-will help statisticians be part of the dialogue regarding Big Data and Big Questions.Specifically, through our shared experience working in industry, government, private consulting, and academia we have identified five key elements which deserve greater emphasis in the undergraduate curriculum (in no particular order):1. Thinking creatively, but constructively, about data. This "data tidying" includes the ability to move data not only between different file formats, but also different shapes. There are elements of data storage design (e.g. normal forms) and foresight into how data should arranged based on how it will likely be used. 2. Facility with data sets of varying sizes and some understanding of scalability issues when working with data. This includes an elementary understanding of basic computer architecture (e.g. memory vs. hard disk space), and the ability to query a relational database management system (RDBMS). 3. Statistical computing skills in a command-driven environment (e.g. R, Python, or Julia). Coding skills (in any language) are highly-valued and increasingly necessary. They provide freedom from the un-reproducible point-and-click application paradigm. 4. Experience wrestling with large, messy, complex, challenging data sets, for which there is no obvious goal or specially-curated statistical method (see SIDEBAR: What's in a name). While perhaps suboptimal for teaching specific statistical methods, these data are more similar to what analysts actually see in the wild. 5. An ethos of reproducibility. This is a major challenge for science in general, and we have the comparatively easy task of simply reproducing computations and analysis.We illustrate how these five elements can be addressed in the undergraduate curriculum. To this end, we explore questions related to airline travel using a large data set (point 4 above) that is by necessity housed in a relational database (2, see SIDEBAR: Databases). We present R code (3) using the dplyr framework (1) -and moreover, this paper itself -in the reproducible R Markdown format (5). Statistical educators play a key role in helping to prepare the next generation of statisticians and data scientists. We hope that this exercise will assist them in narrowing the aforementioned skills gap.A framework for data-related skills The statistical data analysis cycle involves the formulation of questions, collection of data, analysis, and interpretation of results (see Figure 1). Data preparation and manipulation is not j...
Statistical applications in sports have long centered on how to best separate signal (e.g. team talent) from random noise. However, most of this work has concentrated on a single sport, and the development of meaningful cross-sport comparisons has been impeded by the difficulty of translating luck from one sport to another. In this manuscript, we develop Bayesian state-space models using betting market data that can be uniformly applied across sporting organizations to better understand the role of randomness in game outcomes. These models can be used to extract estimates of team strength, the between-season, within-season, and game-to-game variability of team strengths, as well each team's home advantage. We implement our approach across a decade of play in each of the National Football League (NFL), National Hockey League (NHL), National Basketball Association (NBA), and Major League Baseball (MLB), finding that the NBA demonstrates both the largest dispersion in talent and the largest home advantage, while the NHL and MLB stand out for their relative randomness in game outcomes. We conclude by proposing new metrics for judging competitiveness across sports leagues, both within the regular season and using traditional postseason tournament formats. Although we focus on sports, we discuss a number of other situations in which our generalizable models might be usefully applied.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.