The significance of building atlases of human cells as references for future biological and medical studies of human in health or disease has been well recognized. Comparing to the rapidly accumulation of single-cell data, there has been fewer published work on the information structure to assemble cell atlases, or on methods for using reference atlases once they are ready. Most existing cell atlas work organize single-cell gene expression data as a collection of individual files, allowing users to download selected data sheets, or to annotate query cells using models pretrained with the collected data. These features are useful as the basic use of cell atlases. More comprehensive uses of global cell atlases can be developed once data of cells from multiple organs across different studies can be assembled into one orchestrated data repository rather than a collection of data files. For this purpose, we presented a unified giant table or uGT to store and organize single-cell data from multiple studies into a single huge data repository, and a unified hierarchical annotation framework or uHAF to annotate cells from uncoordinated studies. Based on these technologies, we developed a system that enables users to design complex rules to recruit from the atlas cells that meet certain conditions, such as with desired expression range of a gene or multiple genes and with required organ, tissue or developmental origins, across multiple datasets that were otherwise unconnected. The conditions can be expressed as sophisticated logic criteria to pinpoint specific cells that cannot be easily spotted in traditional in vivo or in vitro cell sorting or in traditional searching in published data. We call this technology as in data cell sorting from cell atlases. With the increasing coverage of the cell atlas, this in data experiment paradigm will facilitate scientists to conduct investigations in the data space beyond the restrictions in traditional in vivo and in vitro experiments. In the current work, we collected scRNA-seq data of more than 1 million human cells from scattered studies and assembled them as a human Ensemble Cell Atlas or hECA using the proposed information structure, and provided comprehensive tools for in data experiments based on the atlas. Case examples on agile construction of atlases of particular cell types and on studying the side effects of targeted immune therapy drugs showed that in data cell sorting is an efficient and effective way for comprehensive discoveries. hECA provides a powerful platform for assembling massive scattered single-cell data into a unified atlas, and can serve as a prototype for building future cell atlases.
scCRISPR-seq is an emerging high-throughput CRISPR screening technology that combines CIRPSR screening with single-cell sequencing technologies. It provides rich information on gene regulation. When performing scCRISPR-seq in a population of heterogeneous cells, the observed cellular response in perturbed cells may be caused not only by the perturbation, but also by the infection bias of guide RNAs (gRNAs) mainly contributed by intrinsic differences of cell clusters. The mixing of these effects poisons gene regulation studies. We developed scDecouple to decouple the true cellular response of the perturbation from the influence of infection bias. It models the distribution of perturbed cells and iteratively finds the maximum likelihood of cell cluster proportions as well as the real cellular response for each gRNA. We demonstrated its performance on a series of simulation experiments. By applying scDecouple to real CROP-seq data, we found that scDecouple could enhance biological discovery by detecting perturbation-related genes more critically. It helps to better study gene function and identify disease targets via scCRISPR-seq, especially with heterogeneous samples or complex gRNA libraries.
Chromatin accessibility profiling methods such as assay for transposase-accessible chromatin using sequencing (ATAC-seq) have been promoting the identification of gene regulatory elements and the characterization of epigenetic landscapes. Unlike gene expression data, there is no consistent reference for chromatin accessibility data, which hinders large-scale integration analysis. By analyzing many more than 1000 ATAC-seq samples and 100 scATAC-seq samples, we found that cells share the same set of potential open regions. We thus proposed a reference called consensus peaks (cPeaks) to represent open regions across different cell types, and developed a deep-learning model to predict all potential open regions in the human genome. We showed that cPeaks can be regarded as a new set of epigenomic elements in the human genome, and using cPeaks can increase the performance of cell annotations and facilitate the discovery of rare cell types. cPeaks also performed well in analyzing dynamic biological processes and diseases. cPeaks can serve as a general reference for epigenetic studies, much like the reference genome for genomic studies, making the research faster, more accurate, and more scalable.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.