Programming on a GPU has been made considerably easier with the introduction of Virtual Memory features, which support common pointer-based semantics between the CPU and the GPU. However, supporting virtual memory on a GPU comes with some additional costs and overhead, with the largest being from the support for address translation. The fact that a massive number of threads run concurrently on a GPU means that the translation lookaside buers (TLBs) are oversubscribed most of the time. Our investigation into a diverse set of GPU workloads shows that TLB misses can be extremely high (up to 99%), which inevitably leads to signicant performance degradation due to long-latency page-table walks. Our proling of TLB-sensitive workloads reveals a high degree of page sharing across the dierent cores of a GPU. In many applications, a page can be accessed in temporal proximity by multiple cores, following similar memory access patterns. To support the inherent sharing present in GPU workloads, we propose Valkyrie, an integrated cooperative TLB prefetching mechanism and an inter L1-TLB probing scheme that can eciently reduce TLB bottlenecks in GPUs. Our evaluation using a diverse set of GPU workloads reveals that Valkyrie is able to achieve an average speedup of 1.95⇥, while adding modest hardware overhead. CCS CONCEPTS • Computing methodologies ! Graphics processors; • Software and its engineering ! Virtual memory.