SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo

doi:10.1093/bioinformatics/btt601

Cited by 86 publications

(41 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BioPig includes three modules that can be used in the early phase of NGS data analysis for processing the raw read data files produced by NGS machines. SeqPig [25] is another collection of similar modules to manipulate, analyze and query sequencing datasets. The work by Weiwiorka et al [26] presents analogous analysis tasks implemented on Apache Spark [27].…”

Section: Related Workmentioning

confidence: 99%

Data Management for Heterogeneous Genomic Datasets

Ceri

Kaitoua

Masseroli

et al. 2017

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

Next Generation Sequencing (NGS), a family of technologies for reading DNA and RNA, is changing biological research, and will soon change medical practice, by quickly providing sequencing data and high-level features of numerous individual genomes in different biological and clinical conditions. The availability of millions of whole genome sequences may soon become the biggest and most important "big data" problem of mankind. In this exciting framework, we recently proposed a new paradigm to raise the level of abstraction in NGS data management, by introducing a GenoMetric Query Language (GMQL) and demonstrating its usefulness through several biological query examples. Leveraging on that effort, here we motivate and formalize GMQL operations, especially focusing on the most characteristic and domain-specific ones. Furthermore, we address their efficient implementation and illustrate the architecture of the new software system that we have developed for their execution on big genomic data in a cloud computing environment, providing the evaluation of its performance. The new system implementation is available for download at the GMQL website (http://www.bioinformatics.deib.polimi.it/GMQL/); GMQL can also be tested through a set of predefined queries on ENCODE and Roadmap Epigenomics data at http://www.bioinformatics.deib.polimi.it/GMQL/queries/.

show abstract

Section: Related Workmentioning

confidence: 99%

Data Management for Heterogeneous Genomic Datasets

Ceri

Kaitoua

Masseroli

et al. 2017

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

show abstract

“…As in this case, it might not always be convenient to use Hadoop. The MapReduce concepts have been implemented in many other parallel solutions such as the Genome Analysis Toolkit (GATK) [23]; Hadoop-based set of tools, SeqPig [34]; parallel version of the well-known BLAST and SOM algorithms [35], etc. GATK framework helps the researchers to develop their own tools for the NGS data analysis, overcoming limitations of the existing problem-focused tools or complications of the general frameworks.…”

Section: A Commodity Clustersmentioning

confidence: 99%

The role of high performance, grid and cloud computing in high-throughput sequencing

Lightbody

Browne

Zheng

et al. 2016

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

View full text Add to dashboard Cite

“…This includes a graphical platform to integrate and execute available MapReduce workflows, the possibility to reproduce experiments easily and a simplified access to a MapReduce cluster in private and public clouds. Libraries such as SeqPig [8] or BioPig [9] are helping biologists to use the aforementioned paradigms by abstracting the underlying Hadoop framework and providing high-level Apache Pig functions (or User Defined Functions (UDFs)). Nevertheless, a combination of algorithms to workflows, a standardized way to import/export data and an execution platform for algorithms in public or private cloud infrastructure is still lacking.…”

Section: Bioinformatics Mapreduce Applicationsmentioning

confidence: 99%

Delivering bioinformatics MapReduce applications in the cloud

Forer

Lipić

Schönherr

et al. 2014

2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

View full text Add to dashboard Cite

-The ever-increasing data production and availability in the field of bioinformatics demands a paradigm shift towards the utilization of novel solutions for efficient data storage and processing, such as the MapReduce data parallel programming model and the corresponding Apache Hadoop framework. Despite the evident potential of this model and existence of already available algorithms and applications, especially for batch processing of large data sets as in the Next Generation Sequencing analysis, bioinformatics MapReduce applications are yet to become widely adopted in the bioinformatics data analysis. We identify two prerequisites for their adaptation and utilization: (1) the ability to compose complex workflows from multiple bioinformatics MapReduce tools that will abstract technical details of how those tools are combined and executed allowing bioinformatics domain experts to focus on the analysis, and (2) the availability of accessible and flexible computing infrastructure for this type of data processing. This paper presents integration of two existing systems: Cloudgene, a bioinformatics MapReduce workflow framework, and CloudMan, a cloud manager for delivering application execution environments. Together, they enable delivery of bioinformatics MapReduce applications in the Cloud.

show abstract

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Cited by 86 publications

References 12 publications

Data Management for Heterogeneous Genomic Datasets

Data Management for Heterogeneous Genomic Datasets

The role of high performance, grid and cloud computing in high-throughput sequencing

Delivering bioinformatics MapReduce applications in the cloud

Contact Info

Product

Resources

About