Highlights & graphical abstract• Data science has emerged as a fourth paradigm of science, alongside the theoretical, experimental, and computational. • Structural biology's rich history includes practices, such as an emphasis on openness and reproducibility, which can serve as positive models for many nascent areas of data science. • Machine learning is profoundly impacting the biosciences, based on recent literature trends; we are likely at the cusp of a gold rush moment in structural biology.
Document informationAbstract Data science has emerged from the proliferation of digital data, coupled with advances in algorithms, software and hardware (e.g., GPU computing). Innovations in structural biology have been driven by similar factors, spurring us to ask: can these two fields impact one another in deep and hitherto unforeseen ways? We posit that the answer is yes. New biological knowledge lies in the relationships between sequence, structure, function and disease, all of which play out on the stage of evolution, and data science enables us to elucidate these relationships at scale. Here, we consider the above question from the five key pillars of data science: acquisition, engineering, analytics, visualization and policy, with an emphasis on machine learning as the premier analytics approach.
IntroductionThe term Structural Biology (SB) can be defined rather precisely as a scientific field, but Data Science (DS)is more enigmatic, at least currently. The intrinsic difference is two-fold. First, DS is a young field, so its precise meaning-based on what we practice and how we educate its practitioners-has had less time than SB [1,2] to coalesce into a consensus definition. Second, and more fundamental, DS is interdisciplinary to an extreme; indeed, DS is not so much a field in itself as it is a way of doing science, given large amounts of diverse and complex data, suitable algorithms and sufficient computing resources.Such is the breadth and depth of DS that it has been described as a fourth paradigm of science, alongside the theoretical, experimental and computational [3,4]. Because it is so vast and sprawling, a helpful organizational scheme is to consider four V's and five P's that characterize data and DS (Figure 1). The four V's describe the properties of data: volume, velocity, variety and veracity. The P's are the five disciplinary pillars (P-i through P-v) of DS (Figure 1): (i) data acquisition, (ii) data reduction, integration and engineering, (iii) data analysis (often via machine learning), (iv) data visualization, provenance and dissemination, and (v) ethical, legal, social and policy-related matters. The P's are interrelated, as are the V's. For example, the fifth pillar leans into each of the other four: a host of privacy matters surround data acquisition, aggregation can have unforeseen security concerns, analytics algorithms can introduce unintended bias, and dissemination policies raise licensing and intellectualproperty issues. Similarly, many modes of data analysis (P-iii) rely on advanced visua...