The complexity of soil bacterial communities has thus far confounded effective measurement. However, with improved analytical methods, we show that the abundance distribution and total diversity can be deciphered. Reanalysis of reassociation kinetics for bacterial community DNA from pristine and metal-polluted soils showed that a power law best described the abundance distributions. More than one million distinct genomes occurred in the pristine soil, exceeding previous estimates by two orders of magnitude. Metal pollution reduced diversity more than 99.9%, revealing the highly toxic effect of metal contamination, especially for rare taxa.
Nucleic acid-based biochemical assays are crucial to modern biology. Key applications, such as detection of bacterial, viral and fungal pathogens, require detailed knowledge of assay sensitivity and specificity to obtain reliable results. Improved methods to predict assay performance are needed for exploiting the exponentially growing amount of DNA sequence data and for reducing the experimental effort required to develop robust detection assays. Toward this goal, we present an algorithm for the calculation of sequence similarity based on DNA thermodynamics. In our approach, search queries consist of one to three oligonucleotide sequences representing either a hybridization probe, a pair of Padlock probes or a pair of PCR primers with an optional TaqMan™ probe (i.e. in silico or ‘virtual’ PCR). Matches are reported if the query and target satisfy both the thermodynamics of the assay (binding at a specified hybridization temperature and/or change in free energy) and the relevant biological constraints (assay sequences binding to the correct target duplex strands in the required orientations). The sensitivity and specificity of our method is evaluated by comparing predicted to known sequence tagged sites in the human genome. Free energy is shown to be a more sensitive and specific match criterion than hybridization temperature.
BackgroundClassification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that inheritance and functional pressures exert on the relation between scores and phylogenetic distance, while approaches based on sequence alignment and tree-building are typically limited to a small fraction of gene families. We describe an approach based on finding one or more exact matches between a read and a precomputed set of peptide 10-mers.ResultsAt even the largest phylogenetic distances, thousands of 10-mer peptide exact matches can be found between pairs of bacterial genomes. Genes that share one or more peptide 10-mers typically have high reciprocal BLAST scores. Among a set of 403 representative bacterial genomes, some 20 million 10-mer peptides were found to be shared. We assign each of these peptides as a signature of a particular node in a phylogenetic reference tree based on the RNA polymerase genes. We classify the phylogeny of a genomic fragment (e.g., read) at the most specific node on the reference tree that is consistent with the phylogeny of observed signature peptides it contains. Using both synthetic data from four newly-sequenced soil-bacterium genomes and ten real soil metagenomics data sets, we demonstrate a sensitivity and specificity comparable to that of the MEGAN metagenomics analysis package using BLASTX against the NR database. Phylogenetic and functional similarity metrics applied to real metagenomics data indicates a signal-to-noise ratio of approximately 400 for distinguishing among environments. Our method assigns ~6.6 Gbp/hr on a single CPU, compared with 25 kbp/hr for methods based on BLASTX against the NR database.ConclusionsClassification by exact matching against a precomputed list of signature peptides provides comparable results to existing techniques for reads longer than about 300 bp and does not degrade severely with shorter reads. Orders of magnitude faster than existing methods, the approach is suitable now for inclusion in analysis pipelines and appears to be extensible in several different directions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.