2013
DOI: 10.1093/bioinformatics/btt601
|View full text |Cite
|
Sign up to set email alerts
|

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Abstract: Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Ap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
41
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 86 publications
(41 citation statements)
references
References 12 publications
0
41
0
Order By: Relevance
“…BioPig includes three modules that can be used in the early phase of NGS data analysis for processing the raw read data files produced by NGS machines. SeqPig [25] is another collection of similar modules to manipulate, analyze and query sequencing datasets. The work by Weiwiorka et al [26] presents analogous analysis tasks implemented on Apache Spark [27].…”
Section: Related Workmentioning
confidence: 99%
“…BioPig includes three modules that can be used in the early phase of NGS data analysis for processing the raw read data files produced by NGS machines. SeqPig [25] is another collection of similar modules to manipulate, analyze and query sequencing datasets. The work by Weiwiorka et al [26] presents analogous analysis tasks implemented on Apache Spark [27].…”
Section: Related Workmentioning
confidence: 99%
“…As in this case, it might not always be convenient to use Hadoop. The MapReduce concepts have been implemented in many other parallel solutions such as the Genome Analysis Toolkit (GATK) [23]; Hadoop-based set of tools, SeqPig [34]; parallel version of the well-known BLAST and SOM algorithms [35], etc. GATK framework helps the researchers to develop their own tools for the NGS data analysis, overcoming limitations of the existing problem-focused tools or complications of the general frameworks.…”
Section: A Commodity Clustersmentioning
confidence: 99%
“…This includes a graphical platform to integrate and execute available MapReduce workflows, the possibility to reproduce experiments easily and a simplified access to a MapReduce cluster in private and public clouds. Libraries such as SeqPig [8] or BioPig [9] are helping biologists to use the aforementioned paradigms by abstracting the underlying Hadoop framework and providing high-level Apache Pig functions (or User Defined Functions (UDFs)). Nevertheless, a combination of algorithms to workflows, a standardized way to import/export data and an execution platform for algorithms in public or private cloud infrastructure is still lacking.…”
Section: Bioinformatics Mapreduce Applicationsmentioning
confidence: 99%