As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.
The scalable design of safe guide RNA sequences for CRISPR gene editing depends on the computational "scoring" of DNA locations that may be edited. As there is no widely accepted benchmark dataset to compare scoring models, we present a curated "TrueOT" dataset that contains thoroughly validated datapoints to best reflect the properties of in vivo editing. Many existing models are trained on data from high throughput assays. We hypothesize that such models may suboptimally transfer to the low throughput data in TrueOT due to fundamental biological differences between proxy assays and in vivo behavior. We developed new Siamese convolutional neural networks, trained them on a proxy dataset, and compared their performance against existing models on TrueOT. Our simplest model with a single convolutional and pooling layer surprisingly exhibits state-ofthe-art performance on TrueOT. Adding subsequent layers improves performance on the proxy dataset while compromising performance on TrueOT. We demonstrate that model complexity can only improve performance on TrueOT if transfer learning techniques are employed. These results suggest an urgent need for the CRISPR community to agree upon a benchmark dataset such as TrueOT and highlight that various sources of CRISPR data cannot be assumed to be equivalent. Our codebase and datasets are available on GitHub at github.com/baolab-rice/CRISPR_OT_scoring.
Microfluidics can split samples into thousands or millions of partitions such as droplets or nanowells. Partitions capture analytes according to a Poisson distribution, and in diagnostics, the analyte concentration is commonly calculated with a closed-form solution via maximum likelihood estimation (MLE). Here, we present a generalization of MLE with microfluidics, an extension of our previously developed Sparse Poisson Recovery (SPoRe) algorithm, and an in vitro demonstration with droplet digital PCR (ddPCR) of the new capabilities that SPoRe enables. Many applications such as infection diagnostics require sensitive detection and broad-range multiplexing. Digital PCR coupled with conventional target-specific sensors yields the former but is constrained in multiplexing by the number of available measurement channels (e.g., fluorescence). In our demonstration, we circumvent these limitations by broadly amplifying bacteria with 16S ddPCR and assigning barcodes to nine pathogen genera using only five nonspecific probes. Moreover, we measure only two probes at a time in multiple groups of droplets given our two-channel ddPCR system. Although individual droplets are ambiguous in their bacterial content, our results show that the concentrations of bacteria in the sample can be uniquely recovered given the pooled distribution of partition measurements from all groups. We ultimately achieve stable quantification down to approximately 200 total copies of the 16S gene per sample, enabling a suite of clinical applications given a robust upstream microbial DNA extraction procedure. We develop new theory that generalizes the application of this framework to a broad class of realistic sensors and applications, and we prove scaling rules for system design to achieve further expanded multiplexing. This flexibility means that the core principles and capabilities demonstrated here can generalize to most biosensing applications with microfluidic partitioning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.