Abstract-Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for largescale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.
A doubled haploid population was employed to characterize the dynamic changes of the genetic components involved in rice blast resistance, including main-effect quantitative trait loci (QTLs), epistatic QTLs and QTL-by-environment interactions. The study was carried out at three different developmental stages of rice, using natural infection tests over 2 years. The number of main-effect QTLs, epistatic QTLs and their environmental interactions differed across the various measuring stages. One QTL (d12) on chromosome 12 was detected at all stages, whereas most QTLs were active only at one or two stages in the population. These findings suggest that the unstable expression of most QTLs identified for blast resistance was influenced by the developmental status of the plants, epistatic effects between different loci and the environments in which they were grown. These findings demonstrate the complexity of expression of rice blast resistance and have important implications for durable resistancebreeding and map-based cloning of quantitative traits.
The clustering of putative protein/Open Reading Frame (ORF) sequences available from large-scale metagenomics survey projects is a core analytical function that has led to the identification and characterization of novel protein families of environmental microbial communities. The implementation of this function, however, is currently challenged not only by data size but also by data complexity. In this paper, we present a CPU-GPU implementation of a randomized graph clustering heuristic called Shingling, which was originally developed by Gibson et al. Our implementation uses the CPU and GPU for different stages of computation, using GPUs for the most time-consuming steps. Experimental results of a 2M ocean metagenomics data set obtained from the Sorcerer II Global Ocean Sampling project show that our new implementation is able to achieve a ∼7X speedup over our serial implementation without using asynchronous CPU-GPU communication, with the GPU part alone contributing to over ∼374X speedup in the accelerated part. Qualitative evaluation of the 2M data set shows that our method is able to improve sensitivity of clustering over existing methods, and is more successful in recruiting more sequences into the clustering without impacting the overall specificity. As a demonstration of a large scale run, we were able to cluster a real world homology graph, containing 11M vertices and 640M edges, and constructed from sequences of an ongoing Pacific Ocean metagenomics survey project, in about 94 minutes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.