2021
DOI: 10.48550/arxiv.2106.02584
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Abstract: We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
2

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(10 citation statements)
references
References 47 publications
(65 reference statements)
0
10
0
Order By: Relevance
“…Deep learning for tabular data As described by Borisov et al [2021] in their review of the field, there have been various attempts to make deep learning work on tabular data: data encoding techniques to make tabular data better suited for deep learning [Hancock andKhoshgoftaar, 2020, Yoon et al, 2020], "hybrid methods" to benefit from the flexibility of NNs while keeping the inductive biases of other algorithms like tree-based models [Lay et al, 2018, Popov et al, 2020, Abutbul et al, 2020, Hehn et al, 2019, Tanno et al, 2019, Chen, 2020, Kontschieder et al, 2015, Rodriguez et al, 2019, Popov et al, 2020, Lay et al, 2018 or Factorization Machines Guo et al [2017], tabularspecific transformers architectures Somepalli et al [2021], Kossen et al [2021], Arik and Pfister [2019], Huang et al [2020], and various regularization techniques to adapt classical architectures to tabular data [Lounici et al, 2021, Shavitt and Segal, 2018, Kadra et al, 2021a, Fiedler, 2021. In this paper, we focus on architectures directly inspired by classic deep learning models, in particular Transformers and Multi-Layer-Perceptrons (MLPs).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Deep learning for tabular data As described by Borisov et al [2021] in their review of the field, there have been various attempts to make deep learning work on tabular data: data encoding techniques to make tabular data better suited for deep learning [Hancock andKhoshgoftaar, 2020, Yoon et al, 2020], "hybrid methods" to benefit from the flexibility of NNs while keeping the inductive biases of other algorithms like tree-based models [Lay et al, 2018, Popov et al, 2020, Abutbul et al, 2020, Hehn et al, 2019, Tanno et al, 2019, Chen, 2020, Kontschieder et al, 2015, Rodriguez et al, 2019, Popov et al, 2020, Lay et al, 2018 or Factorization Machines Guo et al [2017], tabularspecific transformers architectures Somepalli et al [2021], Kossen et al [2021], Arik and Pfister [2019], Huang et al [2020], and various regularization techniques to adapt classical architectures to tabular data [Lounici et al, 2021, Shavitt and Segal, 2018, Kadra et al, 2021a, Fiedler, 2021. In this paper, we focus on architectures directly inspired by classic deep learning models, in particular Transformers and Multi-Layer-Perceptrons (MLPs).…”
Section: Related Workmentioning
confidence: 99%
“…Deep learning has enabled tremendous progress for learning on image, language, or even audio datasets. On tabular data, however, the picture is muddier and ensemble models based on decision trees like XGBoost remain the go-to tool for most practitioners [Sta] and data science competitions [Kossen et al, 2021]. Indeed deep learning architectures have been crafted to create inductive biases matching invariances and spatial dependencies of the data.…”
Section: Introductionmentioning
confidence: 99%
“…We also augment these baselines with zero-shot predictions obtained with the same model used to extract the protein sequence embeddings. Lastly, we include ProteinNPT [Notin et al, 2023], a semi-supervised pseudo-generative architecture which jointly models sequences and labels by performing axial attention [Ho et al, 2019b, Kossen et al, 2022…”
Section: Supervised Benchmarksmentioning
confidence: 99%
“…ProteinNPT [Notin et al, 2023] is a semi-supervised non-parametric transformer [Kossen et al, 2022] which learns a joint representation of full batches of labeled sequences. It is trained with a hybrid objective consisting of fitness prediction and masked amino acids reconstruction.…”
Section: A Appendixmentioning
confidence: 99%
“…While this works reasonably well in practice, critical information may be lost in the pooling operation and, since not all residues may be relevant to a given task, we may want to be selective about which ones to consider. In this work, we introduce ProteinNPT ( § 3), a non-parametric transformer [Kossen et al, 2022] variant which is ideally suited to label-scarce settings through an additional regularizing denoising objective, straightforwardly extends to multi-task optimization settings and addresses all aforementioned issues. In order to quantify the ability of different models to extrapolate to unseen sequence positions, we devise several cross-validation schemes ( § 4.1) which we apply to all Deep Mutational Scanning (DMS) assays in the ProteinGym benchmarks [Notin et al, 2022a].…”
Section: Introductionmentioning
confidence: 99%