CD4
+
T cells are critical to fighting pathogens, but a comprehensive analysis of human T cell specificities is hindered by the diversity of HLA alleles (>20,000) and the complexity of many pathogen genomes. We previously described GLIPH, an algorithm to cluster T cell receptors (TCRs) that recognize the same epitope and to predict their HLA restriction, but this method loses efficiency and accuracy when analyzing >10,000 TCRs. Here we describe an improved algorithm, GLIPH2, that can process millions of TCR sequences. We used GLIPH2 to analyze 19,044 unique TCRβsequences from 58 individuals latently infected with
Mycobacterium tuberculosis (Mtb)
and to group them according to their specificity. To identify the epitopes targeted by clusters of
Mtb
-specific T cells, we carried out a screen of 3,724 distinct proteins covering 95% of
Mtb
protein-coding genes using artificial antigen presenting cells (aAPC) and reporter T cells. We found that at least five PPE (Pro-Pro-Glu) proteins are targets for T cell recognition in
Mtb
.