6Summary: Precision medicine uses patient clinical and molecular characteristics to personalize diagnosis 7 and treatment. This emerging discipline integrates multi-modal data into large-scale studies of human 8 disease to make accurate individual-level predictions. The success of these studies will depend on the 9 generalizability of the results, the ability of other researchers and clinicians to replicate studies, and the 10 understandability of the methods used. Tools for data management and standardization are needed to 11 promote flexible, transparent, and reproducible analyses. Here we present cohorts, a python package 12 facilitating clinical and biomarker data management to enhance standardization and reproducibility of 13 clinical findings. 14 Availability: The python package cohorts is available at http://www.github.com/ngiangre/cohorts. 15 CONTACT: NPT2105@CUMC.COLUMBIA.EDU 16 17 19 requires effective data management for reproducibility of clinical research findings (Vargas and Harris, 2016; Leopold20 and Loscalzo, 2018; Niven et al., 2018). The studies often span institutions, departments, and research teams, where 21 patient data is collected and processed in a specialized way. Clinical studies including patient data from multiple 22 sources are integrated and managed by the primary research team, and so correct attribution of the clinical and 23 biomarker characteristics is critical. Feasible and accessible software can facilitate data storage and management, 24 which promotes flexible, transparent, and reproducible analyses. Moreover, software that facilitates integration of 25 multiple patient cohorts can promote the use of advanced statistical and machine learning analyses.
26Precision medicine calls for rigor and reproducibility in data generation and data analysis (Mueller et al., 2018; 27 Orton and Doucette, 2013). For large consortium based studies, computational platforms allow for sharing data and 28 merging of multisite datasets (Lam et al., 2016). While large-scale studies may have an extensive computing platform 29 with custom software (Wolstencroft et al., 2015; Price et al., 2019), most clinical studies are smaller and would benefit 30 from standardized procedures and interoperable data management and analysis tools. There are many software 31 packages and platforms that perform precision medicine data analysis, such as pyGeno (Daouda et al., 2016) and in 32 particular packages from the package manager Bioconductor in the R programming language (Huber et al., 2015).
33However, many of these applications can handle only one experimental type or are particular to a specific end-to-end 34 data analysis. There is a need for flexible and open source software, where patient clinical and biomarker data can 35 originate from different cohorts, platforms, and data types.
36Here, we present the python package cohorts, which provides storage and management of patient clinical and 37 biomarker data. Moreover, cohorts facilitates integration of multiple patient cohort data through common attrib...