Interoperable and scalable data analysis with microservices: applications in metabolomics

Khoonsari, Payam Emami; Moreno, Pablo; Bergmann, Sven; Burman, Joachim; Capuccini, Marco; Carone, Matteo; Cascante, Marta; Atauri, Pedro de; Foguet, Carles; González-Beltrán, Alejandra N.; Hankemeier, Thomas; Haug, Kenneth; He, Sijin; Herman, Stephanie; Johnson, David; Kale, Namrata; Larsson, Anders; Neumann, Steffen; Peters, Kristian; Pireddu, Luca; Rocca-Serra, Philippe; Roger, Pierrick; Rueedi, Rico; Ruttkies, Christoph; Sadawi, Noureddin; Salek, Reza M.; Sansone, Susanna-Assunta; Schober, Daniel; Selivanov, Vitaly A.; Thévenot, Etienne; Vliet, Michael Van; Zanetti, Gianluigi; Steinbeck, Christoph; Kultima, Kim; Spjuth, Ola

doi:10.1093/bioinformatics/btz160

Cited by 26 publications

(27 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Apart from the size, each cluster had the same topology: one master node (configured to act as edge), and a 5-to-3 ratio between service nodes and storage nodes. This service-to-storage ratio was shown to provide good performance, in terms of distributed data processing, in our previous study 59 . Hence, we started with a cluster setup that included 1 master node, 5 service nodes and 3 storage nodes (8 nodes in total, excluding master) and, by doubling size on each run, we scaled up to 1 master node, 40 service nodes and 24 storage nodes (64 nodes in total, excluding master).…”

Section: Deployment Automation Scalabilitymentioning

confidence: 78%

“…Khoonsari et al 59 used the PhenoMeNal VRE to scale the preprocessing pipeline of MTBLS233, one of the largest metabolomics studies available on the Metabolights repository 68 . This is substantially different from the previous benchmarks, as the analysis was composed by several tools chained into a single pipeline, and because the scalability was evaluated over the full workflow.…”

Section: Full Analysis Scalingmentioning

confidence: 99%

See 1 more Smart Citation

On-demand virtual research environments using microservices

Capuccini

Larsson

Carone

et al. 2019

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

The computational demands for scientific applications are continuously increasing. The emergence of cloud computing has enabled on-demand resource allocation. However, relying solely on infrastructure as a service does not achieve the degree of flexibility required by the scientific community. Here we present a microservice-oriented methodology, where scientific applications run in a distributed orchestration platform as software containers, referred to as on-demand, virtual research environments. The methodology is vendor agnostic and we provide an open source implementation that supports the major cloud providers, offering scalable management of scientific pipelines. We demonstrate applicability and scalability of our methodology in life science applications, but the methodology is general and can be applied to other scientific domains.

show abstract

Section: Deployment Automation Scalabilitymentioning

confidence: 78%

Section: Full Analysis Scalingmentioning

confidence: 99%

On-demand virtual research environments using microservices

Capuccini

Larsson

Carone

et al. 2019

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…We implemented a computational workflow to process LC-MS data, illustrated in Figure 3, and evaluated how well it can scale on a Kubernetes infrastructure. The workflow has been described thoroughly elsewhere by Khoonsari et al [12]. Briefly, the open source mzML files were first centroided and calibrated using OpenMS [23].…”

Section: Resultsmentioning

confidence: 99%

“…Thanks to containerisation, scientists can package pipelines in an isolated and self-contained manner, to be distributed and run across a wide variety of computing platforms. Examples of projects in which microservices are a cornerstone include the PhenoMeNal project [12] and the EXTraS project [13].…”

Section: Introductionmentioning

confidence: 99%

Container-based bioinformatics with Pachyderm

Novella

Khoonsari

Herman

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation:Computational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well-defined way to orchestrate those processing stages, and (iii) a data management layer that tracks data as it moves through the processing pipeline. Results:Pachyderm is an open-source workflow system and data management framework that fulfills these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud-agnostic virtual infrastructures was created.

show abstract

“…In PhenoMeNal, we have extended Galaxy, Jupyter, Luigi and Pachyderm in such a way that they can be orchestrated throughout the cloud infrastructure together with the data analysis tools themselves [69]. Six important metabolomics workflows have been fully integrated into PhenoMeNal ( Table 2) and more (mzQuality, NMR-BATMAN) are available for testing ( Fig.…”

Section: Scientific Workflowsmentioning

confidence: 99%

PhenoMeNal: Processing and analysis of Metabolomics data in the Cloud

Peters

Bradbury

Bergmann

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Background: Metabolomics is the comprehensive study of a multitude of small molecules to gain insight into an organism's metabolism. The research field is dynamic and expanding with applications across biomedical, biotechnological and many other applied biological domains. Its computationally-intensive nature has driven requirements for open data formats, data repositories and data analysis tools. However, the rapid progress has resulted in a mosaic of independent -and sometimes incompatible -analysis methods that are difficult to connect into a useful and complete data analysis solution. Findings: The PhenoMeNal (Phenome and Metabolome aNalysis) e-infrastructure provides a complete, workflow-oriented, interoperable metabolomics data analysis solution for a modern infrastructure-as-a-service (IaaS) cloud platform. PhenoMeNal seamlessly integrates a wide array of existing open source tools which are tested and packaged as Docker containers through the project's continuous integration process and deployed based on a kubernetes orchestration framework. It also provides a number of standardized, automated and published analysis workflows in the user interfaces Galaxy, Jupyter, Luigi and Pachyderm. Conclusions: PhenoMeNal constitutes a keystone solution in cloud infrastructures available for metabolomics. It provides scientists with a ready-to-use, workflow-driven, reproducible and shareable data analysis platform harmonizing the software installation and configuration through user-friendly web interfaces. The deployed cloud environments can be dynamically scaled to enable large-scale analyses which are interfaced through standard data formats, versioned, and have been tested for reproducibility and interoperability. The flexible implementation of PhenoMeNal allows easy adaptation of the infrastructure to other application areas and 'omics research domains.

show abstract

Interoperable and scalable data analysis with microservices: applications in metabolomics

Cited by 26 publications

References 53 publications

On-demand virtual research environments using microservices

On-demand virtual research environments using microservices

Container-based bioinformatics with Pachyderm

PhenoMeNal: Processing and analysis of Metabolomics data in the Cloud

Contact Info

Product

Resources

About