This paper describes the e-Science Central (e-SC) cloud data processing system and its application to a number of e-Science projects. e-SC provides both software as a service (SaaS) and platform as a service for scientific data management, analysis and collaboration. It is a portable system and can be deployed on both private (e.g. Eucalyptus) and public clouds (Amazon AWS and Microsoft Windows Azure). The SaaS application allows scientists to upload data, edit and run workflows and share results in the cloud, using only a Web browser. It is underpinned by a scalable cloud platform consisting of a set of components designed to support the needs of scientists. The platform is exposed to developers so that they can easily upload their own analysis services into the system and make these available to other users. A representational state transfer-based application programming interface (API) is also provided so that external applications can leverage the platform's functionality, making it easier to build scalable, secure cloud-based applications. This paper describes the design of e-SC, its API and its use in three different case studies: spectral data visualization, medical data capture and analysis, and chemical property prediction.
The federation of clouds can provide benefits for cloud-based applications. Different clouds have different advantages -one might be more reliable whilst another might be more secure or less expensive. However, being able to select the best combination of clouds to meet the application requirements is not trivial.This paper presents a novel algorithm to deploy workflow applications on federated clouds. Firstly, we introduce an entropy-based method to quantify the most reliable workflow deployments. Secondly, we apply an extension of the Bell-LaPadula Multi-Level security model to meet application security requirements. Finally, we optimise deployment in terms of its entropy and also its monetary cost, taking into account the price of computing power, data storage and inter-cloud communication.To evaluate the new algorithm we compared it against two existing scheduling algorithms: Dynamic Constraint Algorithm (DCA) and Biobjective dynamic level scheduling (BDLS). We show that our algorithm can find deployments that are of equivalent reliability but are less expensive and also meet security requirements. We have validated our solution using workflows implemented in the e-Science Central cloud-based data analysis system.
Abstract-QuantitativeStructure-Activity Relationships (QSAR) is a method to create models that can predict certain properties of compounds. Because of the importance of QSAR in designing new drugs, ability to accelerate this process becomes crucial. One way to achieve that is to be able to quickly explore the QSAR model space in the search for the best models. The cloud computing paradigm very well fits such a scenario, thus we designed and implemented a tool for exploration of the model space using our e-Science Central platform supported by the cloud. We report on scalability achieved and experiences gained when designing this system. The acceleration obtained is much beyond what existing QSAR solutions can offer, which opens potential for new interesting research in this area.
Workflow is a well-established means by which to capture scientific methods in an abstract graph of interrelated processing tasks. The reproducibility of scientific workflows is therefore fundamental to reproducible e-Science. However, the ability to record all the required details so as to make a workflow fully reproducible is a long-standing problem that is very difficult to solve. In this paper, we introduce an approach that integrates system description, source control, container management and automatic deployment techniques to facilitate workflow reproducibility. We have developed a framework that leverages this integration to support workflow execution, re-execution and reproducibility in the cloud and in a personal computing environment. We demonstrate the effectiveness of our approach by examining various aspects of repeatability and reproducibility on real scientific workflows. The framework allows workflow and task images to be captured automatically, which improves not only repeatability but also runtime performance. It also gives workflows portability across different cloud environments. Finally, the framework can also track changes in the development of tasks and workflows to protect them from unintentional failures.
Dataow-style workows o_er a simple, high-level programming model for exible prototyping of scienti_c applications as an attractive alternative to low-level scripting. At the same time, workow management systems (WfMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workow over a cloud infrastructure. In theory, the combination of these properties makes workows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workow design for parallel Big Data problems can be complex and very time-consuming. In this paper we present our experience in porting a genomics data processing pipeline from an existing scripted implementation deployed on a closed HPC cluster, to a workow-based design deployed on the Microsoft Azure public cloud. We draw two contrasting and general conclusions from this project. On the positive side, we show that our solution based on the e-Science Central WfMS and deployed in the cloud clearly outperforms the original HPC-based implementation achieving up to 2.3x speed-up. However, in order to deliver such performance we describe the importance of optimising the workow deployment model to best suit the characteristics of the cloud computing infrastructure. The main reason for the performance gains was the availability of fast, node-local SSD disks delivered by Dseries Azure VMs combined with the implicit use of local disk resources by e-Science Central workow engines. These conclusions suggest that, on parallel Big Data problems, it is important to couple understanding of the cloud computing architecture and its software stack with simplicity of design, and that further e_orts in automating parallelisation of complex pipelines are required.
Scientific workflows are increasingly being migrated to the Cloud. However, workflow developers face the problem of which Cloud to choose and, more importantly, how to avoid vendor lock-in. This is because there are a range of Cloud platforms, each with different functionality and interfaces. In this paper we propose a solution system that allows workflows to be portable across a range of Clouds. This portability is achieved through a new framework for building, dynamically deploying and enacting workflows. It combines the TOSCA specification language and container-based virtualization. TOSCA is used to build a reusable and portable description of a workflow which can be automatically deployed and enacted using Docker containers. We describe a working implementation of our framework and evaluate it using a set of existing scientific workflows that illustrate the flexibility of the proposed approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.