10Recent advances in T cell repertoire (TCR) sequencing allow for characterization of repertoire 11 properties, as well as the frequency and sharing of specific TCR. However, there is no good 12 measure for the local density of a given TCR. TCRs are often described using their 13 Complementary Determining region 3 (CDR3) sequences, V/J usage, and clone size. We here 14 show that the local repertoire density can be estimated using a combined representation of these 15 components through distance conserving autoencoders and Kernel Density Estimates (KDE).
16We present ELATE -an Encoder based LocAl Tcr dEnsity and show that the resulting density 17 of a sample can be used as a novel measure to study repertoire properties. The cross-density 18 between two samples can be used as a similarity matrix to fully characterize samples from the 19 same host. Finally, the same projection in combination with machine learning algorithms can 20 be used to predict TCR-peptide binding through the local density of known TCRs binding a 21 specific target. 22 23 3 24 101 vectors, representing a CDR3 sequence. Encoder's output (decoder's input): the 102 embedded representation with 30 dimensions. Decoder's output: the reconstructed 103 vector. The autoencoder is trained by minimizing the reconstruction MSE (Mean 104 Squared Error). The classifier is a combination of the encoder (with fixed weights) and 6 105 4 more fully connected layers. The classifier is trained by minimizing the BCE loss 106 (Binary Cross Entropy).
107In all the observed datasets, very few pairs of CDR3 sequences differ by less than 2 AA one 108 from each other (more precisely have an edit distance of less than 2). Indeed, when the 109 distribution of edit distance between every pair of sequences from pairs of random samples is 110 calculated, the distributions are consistently normal, with practically no values below 2 (see a
111representing example in Fig.2B). Thus, even an AE producing 1 or 2 errors would lead to a 112 representation closer to the original CDR3 than any other sequence in a different sample. We 113 thus report the fraction of sequences successfully reconstructed with 0, 1, or 2 errors (Fig. 2). 114 Fig 2. A. Edit distances distribution. Histogram of edit distances between all pairs of 115 CDR3 sequences from two samples. Few sequences have 2 or fewer mismatches. B. 116 Encoder's accuracies. All accuracies of AE for the different datasets with no 117 mismatches (fully matched, red), 1 mismatch (orange), and 2 mismatches (blue). C. 118 Encoder's accuracy combined with distances. All accuracies of AE+DIS for the 119 different datasets. D. Encoder and CDR3 distances correlation. Pairwise Euclidean 120 distances in the set of CDR3 one-hot vectors (Xs) in the x axis and pairwise Euclidean 121 distances in the embedded set (Zs) predicted by AE in the y axis. The legend contains 122 the Spearman correlation between the axes. E. Encoder's distances correlation when 123 combined with distances. Pairwise Euclidean distances in the set of CDR3 one-hot 124 vec...