2023
DOI: 10.1101/2023.10.16.562533
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT

Yiqun Chen,
James Zou

Abstract: There has been significant recent progress in leveraging large-scale gene expression data to develop foundation models for single-cell transcriptomes such as Geneformer [1], scGPT [2], and scBERT [3]. These models infer gene functions and interrelations from the gene expression profiles of millions of cells, which requires extensive data curation and resource-intensive training. Here, we explore a much simpler alternative by leveraging ChatGPT embeddings of genes based on literature. Our proposal, GenePT, uses… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 18 publications
(9 citation statements)
references
References 54 publications
0
7
0
Order By: Relevance
“…Exploring the contribution of multimodal data to develop a multimodal FM is also a possible track. For example, incorporating text-based biological information [87] or multi-omic data with new tokens may help us further extend the functions of these FMs.…”
Section: Discussionmentioning
confidence: 99%
“…Exploring the contribution of multimodal data to develop a multimodal FM is also a possible track. For example, incorporating text-based biological information [87] or multi-omic data with new tokens may help us further extend the functions of these FMs.…”
Section: Discussionmentioning
confidence: 99%
“…Cell2Sentence[16], through fine-tuning GPT-2, annotates cells by generating a textual representation for each one, based on the names of the top 100 genes ranked by their transcriptional expression values in the cell. GenePT[5] uses a similar approach for textualizing cells and employs GPT-3.5 to generate embedding vectors, which are then combined with other supervised learning models for various downstream tasks. The scarcity of efforts in fine-tuning LLMs for gene-related issues is apparent, with only Cell2Sentence being a notable attempt at cell annotation.…”
Section: Related Workmentioning
confidence: 99%
“…LLMs can be applied to various scientific research fields with just appropriate textual input (prompt), without the need for additional complex modeling. Nonetheless, the application of LLMs in addressing gene-related issues has been sparse: GenePT[5] utilizes LLMs to generate embedding vectors for genes and cells while Cell2Sentence[16] has fine-tuned GPT-2 for cell annotation tasks. To alleviate this issue, this paper focuses on exploring the performance of LLMs across a spectrum of gene-related problems and evaluates the effectiveness of several mainstream LLMs in these contexts.…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, various classes of neural networks have provided robust and customizable frameworks for guided representation learning. Deep generative models can leverage variational inference (Lopez et al, 2018) or pre-training on masked data (Cui et al, 2023; Chen and Zou, 2023; Rosen et al, 2023) to facilitate a variety of downstream tasks. Given the over-parameterized nature of these networks, a large number of samples is required for the adaptation of these models for clinical genomics applications.…”
Section: Introductionmentioning
confidence: 99%