Counting 𝑘-cliques in a graph is an important problem in graph analysis with many applications such as community detection and graph partitioning. Counting 𝑘-cliques is typically done by traversing search trees starting at each vertex in the graph. Parallelizing 𝑘-clique counting has been well-studied on CPUs and many solutions exist. However, there are no performant solutions for 𝑘-clique counting on GPUs.Parallelizing 𝑘-clique counting on GPUs comes with numerous challenges such as the need for extracting fine-grain multi-level parallelism, sensitivity to load imbalance, and constrained physical memory capacity. While there has been work on related problems such as finding maximal cliques and generalized sub-graph matching on GPUs, 𝑘-clique counting in particular has yet to be explored in depth. In this paper, we present the first parallel GPU solution specialized for the 𝑘-clique counting problem. Our solution supports both graph orientation and pivoting for eliminating redundant clique discovery. It incorporates both vertex-centric and edge-centric parallelization schemes for distributing work across thread blocks, and further partitions work within each thread block to extract fine-grain multi-level parallelism while tolerating load imbalance. It also includes optimizations such as binary encoding of induced sub-graphs and sub-warp partitioning to limit memory consumption and improve the utilization of execution resources.Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean of 12.39×, 6.21×, and 18.99× for 𝑘 = 4, 7, and 10, respectively. We also perform a detailed evaluation of the trade-offs involved in the choice of parallelization scheme, and the incremental speedup of each optimization to provide an in-depth understanding of the optimization space. The insights from our optimization flow can be useful for optimizing other clique finding and graph mining solutions on GPUs. Our code will be open-sourced to enable further research on GPU parallelization of 𝑘-clique counting and other similar graph mining algorithms.