Abstract:Motivation Breakthroughs in high-throughput technologies and machine learning methods have enabled the shift towards multi-omics modelling as the preferred means to understand the mechanisms underlying biological processes. Machine learning enables and improves complex disease prognosis in clinical settings. However, most multi-omic studies primarily use transcriptomics and epigenomics due to their over-representation in databases and their early technical maturity compared to others omics. For complex phenoty… Show more
“…However, incorporating multi-omics data and their crosstalk information in a Transformer is very challenging: when processing multi-omics data, the multi-modal features are usually multiplied by tens of thousands of genes, producing an extremely long input that is not acceptable by a common Transformer model (usually <512 words). Meanwhile, certain embedding methods for biological data, such as discretization and linear transformation, were introduced in the previous Transformer models ( Osseni et al 2022 , Cui et al 2023 , Theodoris et al 2023 ), while biological information was largely lost during these kinds of embedding.…”
Motivation
Multi-omics data provide a comprehensive view of gene regulation at multiple levels, which is helpful in achieving accurate diagnosis of complex diseases like cancer. However, conventional integration methods rarely utilize prior biological knowledge and lack interpretability.
Results
To integrate various multi-omics data of tissue and liquid biopsies for disease diagnosis and prognosis, we developed a biological pathway informed Transformer, Pathformer. It embeds multi-omics input with a compacted multi-modal vector and a pathway-based sparse neural network. Pathformer also leverages criss-cross attention mechanism to capture the crosstalk between different pathways and modalities. We first benchmarked Pathformer with 18 comparable methods on multiple cancer datasets, where Pathformer outperformed all the other methods, with an average improvement of 6.3%-14.7% in F1 score for cancer survival prediction, 5.1%-12% for cancer stage prediction, and 8.1%-13.6% for cancer drug response prediction. Subsequently, for cancer prognosis prediction based on tissue multi-omics data, we used a case study to demonstrate the biological interpretability of Pathformer by identifying key pathways and their biological crosstalk. Then, for cancer early diagnosis based on liquid biopsy data, we used plasma and platelet datasets to demonstrate Pathformer’s potential of clinical applications in cancer screening. Moreover, we revealed deregulation of interesting pathways (e.g., scavenger receptor pathway) and their crosstalk in cancer patients’ blood, providing potential candidate targets for cancer microenvironment study.
Availability
Pathformer is implemented and freely available at https://github.com/lulab/Pathformer.
Supplementary information
Supplementary data are available at Bioinformatics online.
“…However, incorporating multi-omics data and their crosstalk information in a Transformer is very challenging: when processing multi-omics data, the multi-modal features are usually multiplied by tens of thousands of genes, producing an extremely long input that is not acceptable by a common Transformer model (usually <512 words). Meanwhile, certain embedding methods for biological data, such as discretization and linear transformation, were introduced in the previous Transformer models ( Osseni et al 2022 , Cui et al 2023 , Theodoris et al 2023 ), while biological information was largely lost during these kinds of embedding.…”
Motivation
Multi-omics data provide a comprehensive view of gene regulation at multiple levels, which is helpful in achieving accurate diagnosis of complex diseases like cancer. However, conventional integration methods rarely utilize prior biological knowledge and lack interpretability.
Results
To integrate various multi-omics data of tissue and liquid biopsies for disease diagnosis and prognosis, we developed a biological pathway informed Transformer, Pathformer. It embeds multi-omics input with a compacted multi-modal vector and a pathway-based sparse neural network. Pathformer also leverages criss-cross attention mechanism to capture the crosstalk between different pathways and modalities. We first benchmarked Pathformer with 18 comparable methods on multiple cancer datasets, where Pathformer outperformed all the other methods, with an average improvement of 6.3%-14.7% in F1 score for cancer survival prediction, 5.1%-12% for cancer stage prediction, and 8.1%-13.6% for cancer drug response prediction. Subsequently, for cancer prognosis prediction based on tissue multi-omics data, we used a case study to demonstrate the biological interpretability of Pathformer by identifying key pathways and their biological crosstalk. Then, for cancer early diagnosis based on liquid biopsy data, we used plasma and platelet datasets to demonstrate Pathformer’s potential of clinical applications in cancer screening. Moreover, we revealed deregulation of interesting pathways (e.g., scavenger receptor pathway) and their crosstalk in cancer patients’ blood, providing potential candidate targets for cancer microenvironment study.
Availability
Pathformer is implemented and freely available at https://github.com/lulab/Pathformer.
Supplementary information
Supplementary data are available at Bioinformatics online.
“…However, incorporating multi-omics data and their crosstalk information in a Transformer is very challenging: when processing multi-omics data, the multi-modal features are usually multiplied by the number of genes (tens of thousands), producing an extremely long input that is not acceptable by a common Transformer model (usually less than 512). Meanwhile, certain embedding methods for biological data, such as discretization and linear transformation, were introduced in the previous Transformer models [17][18][19] , while biological information was largely lost during these kinds of embedding.…”
Section: Introductionmentioning
confidence: 99%
“…The crisscross attention mechanism of Transformer would be very useful to capture the crosstalk information 15 . However, the current embedding methods for biological data in Transformer models [16][17][18] , such as discretization and linear transformation, usually missed the biological information. Meanwhile, multi-modal features were typically multiplied by the number of genes, resulting in long input sequences, which could strain available memory resources or necessitate a feature selection step 18,19 .…”
Multi-modal biological data integration can provide comprehensive views of gene regulation and cell development. However, conventional integration methods rarely utilize prior biological knowledge and lack interpretability. To address these challenges, we developed Pathformer, a biological pathway informed deep learning model based on Transformer with bias to integrate multi-modal data. Pathformer leverages criss-cross attention mechanism to capture crosstalk between different biological pathways and between different modalities (i.e., multi-omics). It also utilizes SHapley Additive Explanation method to reveal key pathways, genes, and regulatory mechanisms. Through benchmark studies on 28 TCGA datasets, we demonstrated the superior performance and interpretability of Pathformer on various cancer classification tasks, compared to other integration models. Furthermore, we applied Pathformer to liquid biopsy multi-modal data integration with high accuracy in cancer diagnosis. Meanwhile, Pathformer revealed interesting molecularly altered pathways in cancer patients’ body fluid, such as ligand binding of scavenger receptors, iron transport, and DAP12 signaling transmission, which are related to extracellular vesicle transport, platelet, and immune response.
Cancer is a term that denotes a group of diseases caused by the abnormal growth of cells that can spread in different parts of the body. According to the World Health Organization (WHO), cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer, as it is indicative of the biochemical processes in tissue and cells, as well as the genetic characteristics of an organism. Deoxyribonucleic acid (DNA) microarrays and ribonucleic acid (RNA)-sequencing methods for gene expression data allow quantifying the expression levels of genes and produce valuable data for computational analysis. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods. Both conventional and deep learning-based approaches are reviewed, with an emphasis on the application of deep learning models due to their comparative advantages for identifying gene patterns that are distinctive for various types of cancers. Relevant works that employ the most commonly used deep neural network architectures are covered, including multi-layer perceptrons, as well as convolutional, recurrent, graph, and transformer networks. This survey also presents an overview of the data collection methods for gene expression analysis and lists important datasets that are commonly used for supervised machine learning for this task. Furthermore, we review pertinent techniques for feature engineering and data preprocessing that are typically used to handle the high dimensionality of gene expression data, caused by a large number of genes present in data samples. The paper concludes with a discussion of future research directions for machine learning-based gene expression analysis for cancer classification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.