2023
DOI: 10.1101/2023.09.11.557287
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Cell2Sentence: Teaching Large Language Models the Language of Biology

Daniel Levine,
Sacha Lévy,
Syed Asad Rizvi
et al.

Abstract: Large language models like GPT have shown impressive performance on natural language tasks. Here, we present a novel method to directly adapt these pretrained models to a biological context, specifically single-cell transcriptomics, by representing gene expression data as text. Our Cell2Sentence approach converts each cell's gene expression profile into a sequence of gene names ordered by expression level. We show that these gene sequences, which we term "cell sentences", can be used to fine-tune causal langua… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 58 publications
0
5
0
Order By: Relevance
“…GPTCelltype[11] employs GPT-4 to identify cell types based on marker gene information. Cell2Sentence[16], through fine-tuning GPT-2, annotates cells by generating a textual representation for each one, based on the names of the top 100 genes ranked by their transcriptional expression values in the cell. GenePT[5] uses a similar approach for textualizing cells and employs GPT-3.5 to generate embedding vectors, which are then combined with other supervised learning models for various downstream tasks.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…GPTCelltype[11] employs GPT-4 to identify cell types based on marker gene information. Cell2Sentence[16], through fine-tuning GPT-2, annotates cells by generating a textual representation for each one, based on the names of the top 100 genes ranked by their transcriptional expression values in the cell. GenePT[5] uses a similar approach for textualizing cells and employs GPT-3.5 to generate embedding vectors, which are then combined with other supervised learning models for various downstream tasks.…”
Section: Related Workmentioning
confidence: 99%
“…The mainstream approach for textualizing cells involves sorting all genes within a cell by their transcriptional expression values, retaining the top 100 genes, and sequentially concatenating their names. This method, known as "cell sentence" within Cell2Sentence [16], results in a sentence composed of 100 gene names without a natural linguistic structure, significantly diverging from the corpus LLMs encountered during their pre-training phase. To make the "cell sentence" more sentence-like, we introduced the "cell sentence plus", where each gene name is followed by a brief description of that gene, facilitating the LLMs' understanding of the input's meaning, as illustrated in Figure 1c.…”
Section: Textual Representation Of Genes and Cellsmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, Hou and Ji [18] employed ChatGPT for cell type annotation; Wysocki et al [19] probed biomedical information on BioBERT and BioMegatron embeddings; and Ye et al [20] utilized instruction fine-tuning to achieve competitive results on graph data task benchmarks with an LLM. While our paper is under preparation, Levine et al [21] has independently embarked on a conceptually related approach to ours, where each cell is transformed into a sequence of gene names, ranked by expression level and truncated at top 100 genes. The emphasis of their paper, however, is on cell type annotation and generation of new cells conditional on cell types, with an emphasis on generative tasks.…”
Section: Related Workmentioning
confidence: 99%
“…Compared to prior works that directly query LLMs for biological tasks, our method solely utilizes the input descriptions of each gene (which can be sourced from high-quality databases such as NCBI [24]) and the embedding model of LLMs, which suffers less from problems such as hallucination. While our paper is under preparation, Levine et al [25] has independently embarked on a conceptually related approach to ours, where each cell is transformed into a sequence of gene names, ranked by expression level and truncated at top 100 genes. The emphasis of their paper, however, is on generating new cells conditional on cell types.…”
Section: Using Language Models For Cell Biologymentioning
confidence: 99%