Pilot-Jobs (PJ) have become one of the most successful abstractions in distributed computing. In spite of extensive uptake, there does not exist a well defined, unifying conceptual model of Pilot-Jobs, which can be used to define, compare and contrast PJ implementations. This presents a barrier to extensibility and interoperability. This paper is an attempt to, (i) provide a minimal but complete model (P*) of Pilot-Jobs, (ii) establish the generality of the P* Model by mapping various well-known Pilot-Job frameworks such as Condor and DIANE to P*, (iii) demonstrate the interoperable and concurrent usage of distinct pilot-job frameworks on di↵erent production distributed cyberinfrastructures via the use of an extensible API for the P* Model (Pilot-API).
SUMMARYWith the emergence of popular next-generation sequencing (NGS)-based genome-wide protocols such as chromatin immunoprecipitation followed by sequencing (ChIP-Seq) and RNA-Seq, there is a growing need for research and infrastructure to support the requirement of effectively analyzing NGS data. Such research and infrastructure do not replace but complement algorithmic advances developments in analyzing NGS data. We present a runtime environment, Distributed Application Runtime Environment, that supports the scalable, flexible, and extensible composition of capabilities that cover the primary requirements of NGSbased analytics. In this work, we use BFAST as a representative stand-alone tool used for NGS data analysis and a ChIP-Seq pipeline as a representative pipeline-based approach to analyze the computational requirements. We analyze the performance characteristics of BFAST and understand its dependency on different input parameters. The computational complexity of genome-wide mapping using BFAST, amongst other factors, depends upon the size of a reference genome and the data size of short reads. Characterizing the performance suggests that the mapping benefits from both scaling-up (increased fine-grained parallelism) and scaling-out (task-level parallelism -local and distributed). For certain problem instances, scaling-out can be a more efficient approach than scaling-up. On the basis of investigations using the pipeline for ChIP-Seq, we also discuss the importance of dynamical execution of tasks.
The increasing significance of RNAs in transcriptional or post-transcriptional gene regulation processes has generated considerable interest towards the prediction of RNA folding and its sensitivity to environmental factors. We use Boltzmann-weighted sampling to generate RNA secondary structures, which are used to characterize the energy landscape, via the distributions of energies and base-pair distances. Depending upon the length of an RNA, the number of sequences investigated, and the sample size of generated structures -generating and analyzing sufficient samples can be computationally challenging. We introduce and develop a lightweight and extensible runtime environment that is effective across a range of RNA sizes and other parameters, as well as over a range of infrastructure -from traditional HPC grids to clouds, without requiring any changes at the application or user level. The Adaptive Distributed Application Management System (ADAMS) is built upon an extensbile and interoperable pilot-job and supports the concurrent execution of a broad range of task sizes across a range of infrastructure. We use ADAMS to investigate the folding energy landscape for two RNA systems of different sizes: a set of S-adenosyl methionine (SAM) binding RNA sequences known as SAM-I riboswitches and the S gene of the Bovine Corona Virus (BCoV) RNA genome that comprises 4092 nucleotides. Results of the energy and base-pair distance distributions suggest different energy landscapes, implying different folding dynamics. With obtained results, we demonstrated the possibility of utilizing this protocol to explore microscopic origins for reported sequence-dependent variation of binding affinity and gene expression in the two RNA systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.